chapter 1: SDN overview
what is SDN - the history
Network device evolution
Since early 1990 network device manufacturer made a lot of innovation in order to increase router speeds. They started from a router node in which everything was computed into the central CPU to reach a situation where the central CPU is less and less used due to a distributed architecture in which lots of action are done in “line cards”.
These progresses have been made thanks to the use of proprietary TCAM (Ternary Content-Addressable Memory) and ASICs (Application-Specific Integrated Circuit) which have been designed to perform table look up and data packets forwarding at high speed.
In early 2000, the Virtualization for x86 computers support has led to lots of innovation into systems domain. Compute virtualization and High-Speed network devices evolution have enabled the Cloud creation.
Later, It appears it was not convenient to manage several isolated network devices each having their own configuration language. Following needs have emerged:
-
Single point of configuration
-
Configuration protocol standardization
-
Network feature support on x86 servers
-
Extensibility and ability to scale
And these desires called for the cloud and SDN technology development.
Early age of SDN
In Stanford University (US - CA), Clean Slate Research Projects program has been initiated in order to think about how to improve the Internet network architecture. "ETHANE" project was part of this program. Its purpose was to " Design network where connectivity is governed by high-level, global policy". This project is generally known as the first implementation of SDN.
In 2008, a white paper has been proposed by ACM (Association for Computing Machinery) to design a new protocol (OpenFlow) that can program network devices from a network controller.
In 2011, ONF (Open Networking Foundation) has been created to promote SDN Architecture and OpenFlow protocols.
SDN startups acquired by major networks or virtualization vendors
First companies working on SDN have been founded around 2010. (Most of them have now been acquired by main networks or virtualization solution vendors.) In 2007, Martin Casado, who was working on Ethane project has founded Nicira to provide solutions for network virtualization with SDN concept. Nicira has been aquired by vMware in 2012 to develop VMare NSX. In 2016, VMWare also bought PLUMGrid a SDN startup founded in 2013. In 2010, BigSwitch networks has been founded: BigSwitch is proposing a SDN solution. In early 2020, BigSwitch has been acquired by Arista Networks. In 2012, Cisco has created Insieme Networks, a spin-in start-up company working on SDN. In 2013, Cisco take back control on Insieme in order to develop its own SDN solution called ACI (Application Centric Infrastructure). In early 2012, Contrail Systems Inc has been created and aquired at the end of the year by Juniper Networks. In 2013, Alcatel Lucent has created Nuage Networks, a spin-in start-up company working on SDN. Nuage Networks is now an affiliate of Nokia.
The road of SDN development and its history is never straighforward and looks more nuanced than a single storyline might suggest. It’s actually far more complex to be described in a short section. This diagram from [sdn-history] shows developments in programmable networking over the past 20 years, and their chronological relationship to advances in network virtualization.
SDN definition
What is SDN?
The concept of SDN, and the term itself, are both very broad and often
confusing. There is no real accurate definition of SDN, and vendors usually
take it very differently. Initially it was used to in Stanford’s OpenFlow
project, and later it has been extended to include a much wider area of
technologies. Discussion about each vendor’s exact SDN definition is beyond the
scope of this book. But we generally consider that a SDN solution has to
provide one to several of following characteristics:
-
a network control and configuration plane split from the network dataplane.
-
a centralized configuration and control plane (SDN controller)
-
a simplified network node
-
network programmability to provide network automation
-
automatic provisioning (ZTP zero touch provisioning) of network nodes
-
virtualization support and openness
According to [onf-sdn-definition], Software-Defined Networking (SDN) is:
The physical separation of the network control plane from the forwarding plane, and where a control plane controls several devices
In this diagram, you can see that SDN allows simple high-level policies in the "application layer" to modify the network, because the device level dependency is eliminated to some extent. The network administrator can operate the different vendor-specific devices in the "infrastructure layer" from a single software console - the "control layer". The "controller" in control layer is designed with such a way that it can view the whole network globally. This controller design helps a lot to introduce functionalities or programs as they just needs to talk to the centralized controller, without the need to know all details communicating with each individual device. These details are hidden by the controller from the applications.
Several expectations are behind this new model:
-
openness: communication between controller and network device uses standardized protocols like REST, OpenFlow, XMPP, NetConf, etc. This eliminates traditional vendor lock-in, giving you freedom of choice in networking.
-
cost reduction: because of the openness, you can pick which ever low-cost vendor for your infrastructure (hardware).
-
automation: the controller layer has a global view of whole network. with the API exposed by the control layer, from the application perspective it’s much easier to automate network devices application.
|
Note
|
in this diagram, "openflow" is marked as the protocol between control layer and infrastructure layer. This is to give an example about the standard communication protocols. As of today more choices are available and standardized in the SDN industry, which will be covered later in this chapter. |
Traditional Network Planes and SDN layer
traditionally, A typical network device (e.g. a router) has following planes:
-
Configuration (and management) plane: used for network node configuration and supervision. Examples of widely use protocols are CLI (Command Line Interface), SNMP (Simple Network Management Protocol) and NetConf.
-
Control plane: used by network nodes to make packet forwarding decision. In traditional networks there have been a wide range of various different network control protocols running in the networks. Common examples are OSPF, ISIS, BGP, LDP, RSVP-TE, etc.
-
Forwarding (or data or user) plane: This plane is responsible to perform data packet processing and forwarding. This forwarding plane is made of proprietary protocols and is specific to each network equipment vendor.
configuration and control plane are located in device’s main processor card, oftenly called "routing engine", or "routing switching engine". The forwarding plane is located in the device’s packet forwarding card, oftenly called "line card".
SDN architecture is built with 3 layers:
-
Application Layer: containing all the application provided by the SDN solution. Generally a Web GUI dashboard is the first application provided to SDN users. Other common applications are Network infrastructure interconnection interfaces allowing the SDN solution to be plugged to a Cloud Infrastructure or a Container orchestrator.
-
Control Layer: containing the SDN controller. This is the most intelligent part of a SDN solution. The SDN controller is made up of:
-
the SDN engine, made up of SDN Control Logic and databases.
-
"Southbound" interfaces that are used to control SDN network nodes. Most commonly used southbound interface protocols are OpenFlow, XMPP and OVSDB.
-
"Northbound" interfaces that are used to expose services provided by the infrastructure layer "upward" to the SDN applications. The most commonly used northbound interface protocol is HTTP/REST.
-
-
Infrastructure Layer: containing the SDN network nodes. This is the work load of a SDN solution. SDN network nodes can be either physical or virtual nodes. Typically, on each SDN node, there are:
-
a SDN agent: which is handling the communication between each SDN network node and the SDN controller.
-
A flow/routing table built by the SDN Agent.
-
A forwarding plane engine
-
the primary changes between SDN and traditional networking
In a traditional infrastructure, the route calculation is made on each individual router. each router needs to run one or several routing protocols, through which it exchanges routes with the rest routers in the network, and eventually, based on the route information learned, each router assumes it gains enough knowledge about the network in order to make the forwarding decision. From the network perspective, the control plane is distributed in each individual router, and the end to end routing path is the result of all decisions made by the control plane located on each router.
The control plane on one router may look like this:
In reality, for example, a simplified Juniper MX control plane typical looks like this:
Running a control plane on each router make it very hard to manage, because each individual network device needs to be carefully configured. It requires extensive, vendor-specific experiences and skills to configure the device. The high number of configuration points often make it very challenging to build a robust network. Flexibility is also a recurring hurdle for traditional networks since most routers run proprietary hardware and software.
In contrast, in SDN networking, Control and Configuration functions are gathered into a "SDN controller" which is controlling Network devices. The new architecture intends to provide a completely new way to configure the network. This new Cloud infrastructure brings:
-
simplified routers, without complex control plane in each router.
-
a centralized control plane, which is a single configuration point
Let’s compare the two architectures:
This SDN infrastructure uses a centralized configuration and control point. route calculation is done centrally in the controller and distributed into each SDN network node. Well the idea looks good and simple, it requires a few foundamental protocols and infrastructures to be implemented before this model can work:
-
a southbound network protocol: is needed to allow routing information being exchanged between the SDN controller and each controlled element.
-
A "underlay" network: A network infrastructure is allowing the communication between SDN controller and SDN network nodes, and data packet transfer between SDN nodes.
This underlay network infrastructure is playing the same role that the local switch fabric is doing inside a standalone router between the control processor card and lines cards. Based on it, A "overlay" network can be built by the controller, which basically hides underlay network infrastructure details from the applications so they will focus on the high level service implementations. we’ll talk more about "underlay" and "overlay" in the next section.
convenient as it is, this makes the controller the weakest point in the whole model. Think of what will happen if this SDN controller, serving as the "brain", stops working. Everything will be frozen and nothing works as expected, or even worse, some part of the infrastructure continues to run but in an unexpected way, which will very likely trigger bigger issues to other part of the network.
Lots of efforts are done by each SDN solution supplier to solve this weakness. A common and efficient practice is to use clustered architecture to build a highly resilient controller cluster. e.g 3 SDN controllers can load balance and/or backup each other. on failure of one or two, the other one can still make the whole cluster survive, giving the operator longer maintanence windows to fix the problem.
underlay vs overlay
In SDN architecture, each network node is connected to a physical network infrastructure. This physical network which is providing basic connectivity between network nodes is called the "underlay" network infrastructure. sometimes it is also called "fabric", and typically it’s a plane L3 IP network.
very often The underlay needs to separate between different administrative domains (often called "tenants"), switch within the same L2 broadcast domain, route between L2 broadcast domains, provide IP separation via VRFs, and etc. This is implemented in the form of "overlay" networks. The overlay network is a logical network that runs on top of the underlay network. The overlay is formed of tunnels to carry the traffic across the L3 fabric.
Today the industry began to shift in the direction of building L3 data centers and L3 infrastructures, mostly due to the rich features coming from L3 technologies, e.g, ECMP load balancing, flooding control, etc. But the L2 traffic does not disappear and most likely it never will. There are always the desire that a group of network users need to reside in the same L2 network - typically a VLAN. However, In today’s virtualization environment, a user’s VM can be spawned in any compute located anywhere in the L3 cluster. Even if 2 VMs are spawned in the same server, there is often a need to move them around between different servers without changing their networking attributes. These requirements to make a VM always belonging to the "same VLAN" calls for an overlay model over the L3 network. In other words, we need a new mechanism to allow us to tunnel L2 Ethernet domains with different encapsulations over an L3 network.
For example, in SDN node1 we were running VM11 and VM12, they were both serving same sales department and so they were located in same VLAN. because of some administrative requirement, VM12 needs to be moved to another physical SDN node2 which, may be physically located in another rack that is a few router "hops" away. Now we need to ensure not only data packet from VM11 in SDN node1 to be able to reach VM12 in SDN node2, but also they are talking to each other as if they are still in the same VLAN, exactly the same way as before just as if VM12 has never moved. This ability to make the "local" (in same VLAN) traffic to traverse transparently across underlay network infrastructure calls for a packet encapsulation, or "tunneling" mechanism in SDN networks.
Indeed, without such an encapsulation mechanism, traditional segmentation solutions (VLAN, VRF) would have to be provided by the physical infrastructure and implemented up to each SDN node, in order to provide an isolated transportation channel for each customer network connected to the SDN infrastructure.
Encapsulation protocols used in SDN networks have to provide:
-
network segmentation: ability to build several different network connectivity between 2 SDN network nodes.
-
ability to carry transparently Ethernet frames and IP packets
-
ability to be carried over an IP connectivity
Several encapsulation protocols are used into SDN networks:
-
VxLAN
-
MPLS over GRE
-
MPLS over UDP
-
NVGRE
-
Geneve
-
STT
These encapsulation protocols are providing Overlay connectivity which is required between customers workload connected to the SDN infrastructure. Each SDN node is call a VTEP (Virtual Tunnel End Point) as it is starting and terminating the overlay tunnels.
interfaces between layers
We’ve seen "openflow" marked as one of the possible interfaces in the "SDN layer" section. Now we’ll introduce the concept of "southbound" and "northbound" interface and other available choices in today’s industry.
The "southbound" interface resides between the controller in "control layer" and
network devices in "infrastructure layer". Basically what it does is to provide
a means of communication between the 2 layers. Based on the demands and needs, a
SDN Controller will dynamically changes the configuration or routing information
of network devices. For example, a new VM will advertise a new subnet or host
routes when it is spawned in a server, this advertisement will be delivered to
SDN controller via a southbound protocol. Accordingly, SDN controller collects
all routing updates from the whole SDN cluster through the southbound
interfaces, and decides the most current and best route entries, then, it may
"reflect" these information to all other network devices or VMs. This ensures
all devices having the most uptodate routing information in real time. Among
others, examples of the most well-known southbound interfaces in the industry
are openflow, OVSDB and XMPP.
openflow is a protocol that sends flow information into the virtual switch so the switch can forward the packets between the different ports. Flows are defined based on different criteria such as traffic between a source MAC address and a destination MAC address, source and destination IP addresses, TCP ports, VLANs, tunnels, and so on.
OpenFlow is one of the most widely deployed southbound standard from open source community. It first made its appearance in 2008 by Martin Casado at Stanford University. The appearance of OpenFlow was one of the main factors which gave birth to Software Defined Networking.
OpenFlow provides various information for the Controller. It generates the event-based messages in case of port or link changes. The protocol generates a flow based statistic for the forwarding network device and passes it to the controller.
OpenFlow also provides a rich set of protocol specifications for effective communication at the controller and switching element side. Open Flow provides an open source platform for Research Community.
Every physical or virtual OpenFlow-enabled network (data plane) devices in the
SDN domain needs to first register with the OpenFlow controller. The
registration process is completed via an OpenFlow HELLO packet originating
from the OpenFlow device to the SDN controller.
abbreviation for "Open vSwitch Database". unlike openflow, OVSDB is a southbound API designed to provide additional management or configuration capabilities like networking functions. With OVSDB we can create the virtual switch instances, set the interfaces and connect them to the switches. We can also provide the QoS policy for the interfaces. OVSDB is a protocol written in the JavaScript Object Notation (JSON) that basically sends and receives commands via JSON RPCs.
The northbound interface provides connectivity between the controller and the network applications running in management plane. As we already discussed that southbound interface has OpenFlow as open source protocol, northbound lacks such type of protocol standards. However with the advancement of technology now we have a wide range of northbound API support like ad-hoc API’s, RESTful APIs etc. The selection of northbound interface usually depends on the programming language used in application development.
more alphabet soup of terms
with the development of virtualization, SDN technologies and their ecology in recent years, more and more terms and changing of these terms emerge in the networking industry. a lot of confusions have rised, often because of terms are referring different things when they are used in different context. Sometimes the latest term the industry uses is a particular technology such as VNF or a concept such as NFV. Terms rise and fall out of favor as the industry evolves. In recent years the terms such as openstack, NVF/VNF has become the industry’s favorite buzzword. This raises the question - just what is openstack, NVF/VNF and what are the relationships with SDN?
NFV/VNF sounds like new buzzwords, but those technologies have been around
for years. according to ETSI:

NFV means "network function virtualization", it stands for an "operation
framework for orchestrating and automating VNFs". And VNF means "virtualized
network function", such as virtualized routers, firewalls, load balancers,
traffic optimizers, IDS or IPS, web application protectors, and so on.
in a nut shell you can think of NVF as a "concept", or "framework" to virtualize
certain network functions, while VNF is the implementations of each individual
network functions.
Among others, firewalls and load balancers are the two most common VNFs in the
industry, especially for deployments inside data centers. When you read today’s
documents about virtualization technology, you will see the terms in such a
pattern like "vXX" (e.g. vSRX, vMX), or "cXX" (e.g. cSRX) very often. that
letter v indicates it is a "virtualized" product, while letter c -
"containerized" is its container version.
Jointly launched by NASA and Rackspace in 2012, Openstack has rapidly gained popularity in many enterprise data centres. It is one of the most used open source cloud computing platform to support software development and Big Data analytics. OpenStack comprises a set of software modules, e.g, compute, storage & networking modules, which works together to provide an open source choice for building private & public cloud environments. As an IaaS (Infrastructure As A Service) open source implementation ,it provides a wide range of services, from basic service like computing service, storage service, networking service, etc, to advanced services like database, container orchestration and others.
You can think of Openstack as an abstraction layer providing a cloud environment on your promise. with openstack installed in your servers, ,you can spawn a VM, consume and recycle it when you are done, all in seconds. under that abstraction layer, Openstack hides most complexities of automation and orchestration of diverse underlying resources like compute, storage and networking. you could choose Servers, storage, networking devices from your favorite vendors to build the underlying infrastructure, and openstack will "consume" all of them and expose to the user as a pool of common "resources": number of CPUs, RAMs, hard disk spaces, IP addresses, etc. The user does not (need to) care about vendor and brand details.
If we compare openstack with SDN, it’s not hard to see that the two model shares
some common features. Both models provide certain level of abstractions, hide
the low level hardware details and expose to upper level user applications. the
differences are somewhat subtle to describe in just a few words. First off,
although there are various distributions from different vendors, they share
common core components that is managed by the OpenStack Foundation. SDN is more
of a "framework" or an "approach" to manage the network dynamically, which can
be implemented with totally different software techniques. Secondly, From the
perspective of technical ecological coverage, the ecological aspects of
OpenStack are much wider, because networking is just one of its services that is
implemented by its Neutron component among it’s other various plugins. SDN,
and its ecology, in contrast, mainly focus on the networking. There are also
difference in the way that Neutron works comparing with how a typical SDN
controller works. OpenStack Neutron focuses on providing network services for
virtual machines, containers, physical servers, etc, and provides a unified
northbound REST API to users, SDN focuses on configuration and management of
forwarding control toward the underlaying network device, it not only provides
user-oriented northbound API, but also provides standard southbound API to
communicating with various hardware devices.
|
Note
|
The comparison between openstack and SDN here are more of conceptual. In reality these two models can, and in fact often, coupled with each other in some way, loosely or tightly. one example is TF, which we’ll talk about later in this chapter. |
SDN solutions
controllers
As we’ve mentioned in previous sections, SDN is a networking scenario which changes the traditional network architecture by bringing all control functionalities to a single location and making centralized decisions. SDN controllers are the brain of SDN architecture, which perform the control decision tasks while routing the packets. Centralized decision capability for routing enhances the network performance. As a result, SDN controller is the core components of any SDN solutions.
While working with SDN architecture, one of the major point of concerns is which controller and solution should be selected for deployment. There are quite a few SDN controller and solutions implementations from various vendors, and every solution has its own pros and cons along with its working domain. In this section we’ll review some of the popular SDN controllers in the market, and the corresponding SDN solutions.
opendaylight (ODL)
OpenDaylight, aften abbreviated as ODL, is a Java based open source project started from 2013, it was originally led by IBM and Cisco but later hosted under the Linux Foundation. it was the first open source Controller that can support non-OpenFlow southbound protocols, which can make it much easier to be integrated with multiple vendors.
ODL is a modular platform for SDN. It is not a single piece of software. It is a modular platform for integrating multiple plugins and modules under one umbrella There are many plugins and modules built for OpenDaylight. Some are in production, while some are still under development.
Some of the initial SDN controllers had their southbound APIs tightly bound to OpenFlow, But as we can see from the diagram, besides openflow, many other southbound protocols that are available in today’s market are also supported. Examples are NETCONF, OVSDB, SNMP, BGP, etc. Support of these protocols are done in a modular method in the form of different plugins, which are linked dynamically to a central component named "Service Abstraction Layer (SAL)". SAL does translations between the SDN application and the underlaying network equipments. for instance, when it receives a service request from a SDN application, typically via high level API calls (northbound), it understands the API call and translates the request to a language that the underlying network equipments can also understand. That language is one of the southbound protocols.
While this "translation" is transparent to the SDN application, ODL itself needs
to know all the details about how to talk to each one of the network devices it
supports, their features, capabilities etc. a topology manager module in OLD
manages this type of information. What topology manager does is to collect
topology related information from various modules and protocols, such as ARP,
host tracker, device manager, switch manager, OpenFlow, etc, and based on these
info, it visualize the network topology by drawing a diagram dynamically, all
the managed devices and how they are connected together will be showed in it.
any topology changes, such as adding new devices, will be updated in the database and reflected immediately in the diagram.
Remember earlier we mentioned that an SDN controller has "global view" of the whole SDN network. In that sense ODL has all necessary visibility and knowledge of the network that can be used to draw the network diagram in realtime.
OVS[ovs]
Open vSwitch (OVS) introduction
OVS is one of the most popular and "production quality" open source implementation of a multilayer virtual switch. OVS was created by Nicira back in 2009, which was acquired by VMware. It is licensed under the Apache 2.0 license and provided by Linux Foundation.The virtual switch basically does most of the jobs you could expect a physical switch does, but in a software method. OVS is typically running with linux hypervisors like KVM and can be loaded on a Linux kernel. OVS supports most features supported in tradtional physical switches, such as:
-
802.1Q and VLAN
-
BFD
-
NetFlow/sFlow
-
port mirroring
-
LACP
-
VXLAN
-
GENEVE GRE Overlays
-
STP
-
IPv6
Beside functions of traditional switches, the bigger advantage of OVS is that
it also has native support to SDN solution via OVSDB and OpenFlow
protocols. That means any SDN controller can integrate OVS via these 2 open
standard protocols. Therefore OVS can work either as a standalone L2 switch
within a hypervisor host, or it can be managed and programmed via an SDN
controller, such as ODL. that is why it is used in so many open source and
commercial virtualization projects.
The OVS archetecture
Open vSwitch introduced an architecture that comprises an SDN controller that
configures and manages virtual switches via the OVSDB protocol and pushes
flows inside the switches via the OpenFlow protocol. In general the OVS
comprises the following components:
-
an ovsdb-server database
-
an ovsdb-vswitchd daemon
-
a kernel module
The architecture is described in this figure:
This is a configuration database that controls and stores the switch-level configuration. It contains information on creating bridges, attaching interfaces, attaching tunnels, and so on. these objects are organized in the form of a set of different tables that points to each other in a certain sequence:
-
OVS table
-
bridge table
-
port table
-
interface table
For example, an entry in the top level OVS table points to a brige table, which
has items pointing to a port table, which in turn, points to a interface table.
The stateful database make sure the system can recover back to the exact status
it was in case it rebooted. The ovsdb-server database talks to the outside
controller via the OVSDB protocol.
This is the heart of OVS and where flow processing happens. ovsdb-vswitchd
has all the information (e.g. bridges, flow tables, etc) needed to forward
packets. It has different interfaces to other components. Inside of the
hypervisor, it connects to ovsdb-server via the OVSDB protocol, and to the
kernel module via a Linux Netlink interface. To outside controller, it runs
OpenFlow protocol to exchange flow information.
OVS workflow
Ovsdb-vswitchd pushes the flows to the kernel module for fast forwarding.
When the first packet arrives, it goes through the kernel module, where the
headers are hashed to find a flow entry. If the flow entry is not found, the
packet goes to ovsdb-vswitchd for normal processing. Ovsdb-vswitchd then
pushes the flow to be cached inside the module kernel. If a similar flow comes
in, it is forwarded via the fast path inside the kernel module. The kernel
module does not contain any of the OpenFlow tables that are known to
ovsdb-vswitchd; rather, it contains the result of the different lookups in the
flow tables. The kernel module also handles the tunneling of packets via
protocols such as GRE, VXLAN, and others.
calico
calico introduction
quote from calico official website:
Calico is an open source networking and network security solution for containers, virtual machines, and native host-based workloads. Calico supports a broad range of platforms including Kubernetes, OpenShift, Docker EE, OpenStack, and bare metal services.
Calico has been an open-source project from day one. It was originally designed for today’s modern cloud-native world and runs on both public and private clouds. Its reputation mostly comes from it’s deplayment in Kubernetes and its ecosystem environments. Today Calico has become one of the most popularly used kubernetes CNIes and many enterprises using it at scale.
Comparing with other overlay network SDN solutions, Calico is special in the sense that it does not use any overlay networking design or tunneling protocols, nor does it require NAT. Instead it uses a plain IP networking fabric to enables host to host and pod to pod networking. The basic idea is to provides Layer 3 networking capabilities and associates a virtual router with each node, so that each node is behaving like a traditional router, or a "virtual router". We know that a typical Internet router relies on routing protocols like OSPF, BGP to learn and advertise the routing information, and That is the way a node in calico networking works. It chooses BGP, because of it’s simple, industry’s current best practice, and the only protocol that sufficiently scale.
calico uses a policy engine to deliver high-level network policy management.
calico archetecture
Calico is made up of the following components:
-
Felix: the primary Calico agent that runs on each machine that hosts endpoints.
-
The Orchestrator plugin: orchestrator-specific code that tightly integrates Calico into that orchestrator.
-
BIRD: a BGP speaker that advertise and install routing information.
-
BGP Route Reflector (BIRD): an optional BGP route reflector for higher scale.
-
calico CNI plugin: connect the containers with the host
-
IPAM: for IP address allocation management
-
etcd: the data store.
This is calico "agent" - a daemon that runs on every workload, for example on nodes that host containers or VMs. it is the one that performs most of the "magics" in the calico stack. It is responsible for programming routes and ACLs, and anything else required on the host, in order to provide the desired connectivity for the endpoints on that host.
Depending on the specific orchestrator environment, Felix is responsible for the following tasks:
-
Interface management (ARP response)
-
Route programming (linux kernel FIB)
-
ACL programming (host IPtables)
-
State reporting (health check)
It does all this by connecting to etcd and reading information from there. It
runs inside the calico/node DaemonSet along with confd and BIRD.
The orchestrator plugins are essentially responsible for API translations. Calico has a separate plugin for each major cloud orchestration platforms (e.g. OpenStack, Kubernetes).
For example in openstack environment, a Calico Neutron ML2 driver integrates with Neutron’s ML2 plugin to allows users to configure the Calico network simply by making Neutron API calls. This provides seamless integration with Neutron.
the backend data store for all the information Calico needs. it can be the same of different etcd that kubernetes use.
it has at least, but not limited to the following information:
* list of all workloads (endpoints)
* BGP configuration
* policys from user (e.g. defined via the calicoctl tool)
* information about each container (pod name, IP, etc), received from calico CNI
Calico makes uses of BGP to propagate routes between hosts. And the BGP
"speaker" in calico is BIRD - a routing daemon that runs on every host that
also hosts Felix module in the Kubernetes cluster, usually as a DaemonSet. It
’s included in the calico/node container. it’s role is to read routing state
that Felix programs into the kernel and distribute it around the data center.
comparing with what Felix does, one of the main differences is that Felix
"insert" routes into the linux kernel FIB and BIRD "distribute" them to all
other nodes in the deployment, this turns each host to a virtual Internet BGP
router ("vRouter"), and ensures that traffic is efficiently routed around the
deployment.
confd is a simple configuration management tool. In Calico, BIRD does not deal with etcd directly, it is another module "confd" that reads the BGP configuration from etcd and feed to BIRD in the form of configurations files in disk.
configure IP, routes
CNI stands for "container networking interface".
There’s an interface for each pod, When the container spun up, calico (via CNI) created an interface for us and assigned it to the pod.
when a new pod starts up, Calico will: - query the kubernetes API to determine the pod exists and that it’s on this node - assigns the pod an IP address from within its IPAM - create an interface on the host so that the container can get an address - tell the kubernetes API about this new IP
as the name indicated already, Calico’s IPAM plugin is responsible for "IP address management". when a new container is spawn, calico IPAM plugin reads information from etcd database to decide which IP is available to be allocated to the container. the IP address by default will be allocated in the unit of /26 "block". a block is essentially a subnet which aggregate the routes to save routing table spaces.
calico workflow
-
A container is spawned
-
calico IPAM plugin assign an IP address from an IP block (by default /26). it then records this in etcd.
-
calico CNI apply the network configuration to the container so it has a default route pointing to the host. CNI also save these information to etcd.
-
calico felix appy the network configuration to the host, so it is aware of the new container, and be ready to receive packets from it.
-
confd read the data from etcd and generate the routing configuration, BIRD use these configuratioin to establish BGP neighborship with other nodes. it then advertises the container subnet to the rest of the cluster via BGP
-
all other hosts in the same cluster will learn this subnet via BGP and install the route into its local routing table, now the new container is reachable from anywhere in the cluster.
-
user may configure a routing policy, e.g. via the
calicoctlcommands. the policy will be save in etcd database. felix read this policy and applies it to the firewall configurations.
VCP(nuage)
VCP introduction
The Virtualized Cloud Platform (VCP) is created by Nuage networks. It provides a "policy-based" SDN platform that has a data plane built on top of the open source OVS, and a SDN controller built on open standards.
The Nuage platform uses overlays to provide policy-based networking between different clouding environment (Kubernetes Pods or non-Kubernetes environments such as VMs and bare metal servers). it also has a real-time analytics engine to monitor Kubernetes applications.
All components can be installed in containers. There are no special hardware requirements.
VCP architecture
-
virtualized services directory (VSD)
-
virtualized services controller (VSC)
-
virtualized routing and switching (VRS)

In Nuage VCP, The Virtualised Services Directory (VSD) is a policy engine, business logic and analytics engine that supports the abstract definition of network services. Through RESTful APIs to VSD, administrators can define and refine service designs and incorporate enterprise policies.
It is a web-based, graphical console that connects to all of the VRS nodes in the network to manage their deployment and configuration.
The VSD policy & analytics engine presents a unified web interface where configuration and monitoring data is presented. The VSD is API-enabled for integration with other orchestration tools. Alternatively, you can develop your apps. Either way, the VSD is based on tools from the service provider world, and therefore scaling potential looks very good. It integrates multiple data centre networks by linking VSDs together and exchanging policy data.
Nuage Virtual Services Controllers (VSC) works between VSD and VRS. policies from VSD is distributed through a number of VSC to all of the VRS nodes in the network to manage their deployment and configuration.
VSC is SDN controller in Nuage VCP architecture. it provides a robust control plane for the datacenter network, maintaining a full per-tenant view of network and service topologies. Through network APIs that use southbound interfaces (e.g. OpenFlow), VSC programs the datacenter network independent of different hardwares.
The VSC implements an OSPF, IS-IS or BGP listener to monitor the state of the physical network. Therefore, if routes starts flapping, the VSC is able to incorporate those events into the decision tree.
while scalability in a single data center can be achieved by setting up multiple VSC, each handling a certain group of VRS devices, scalability between multiple data centres can be achieved by connecting VSC controllers horizontally at the top of the hierarchy.
As shown in the diagram above, VSC controllers are synchronised using MP-BGP. A BGP connection peers with PE routers at the WAN edge, and then the VSC controller uses MP-BGP to synchronise controller state & configuration with VSCs in other data centres. This is vital for end-to-end network stability.
When dVRS devices are communicating to non-local dVRS devices, data is tunnelled in MPLS-over-GRE to the PE router.
The VRS module serves as a virtual endpoint for network services. It detects changes in the compute environment as they occur and instantaneously triggers policy-based responses to ensure that the network connectivity needs of applications are met.
configuration of the VRS is derived from a series of templates.
Each VRS routes traffic into the network according to its flow table. Therefore, the entire VRS system performs routing at the edge of the network.
A VRS can’t make a forwarding decision in a vacuum, as events in the underlying physical network must be considered. Nuage Networks has extensively considered how to provide the VSC controller with all the information required to have a complete model of the network.
VCP workflow
Overview of Tungsten Fabric (TF)
TF introduction
The Tungsten Fabric (TF), is an open-standard based, proactive overlay SDN solution. It works with existing physical network devices and help address the networking challenges for self-service, automated, and vertically integrated cloud architecture. It also improves scalability through a proactive overlay virtual network technique.
TF controller integrates with most of the popular cloud management systems such as OpenStack, vmware, and Kubernetes. TF’s focus is to provide networking connectivity and functionalities, and enforce user-defined network and security policies to the various of workloads based on different platforms and orchestrators.
Tungsten Fabric’s primary claim to fame is that it is diligently multi-cloud and multi-stack. Today it supports:
-
Multiple compute types: baremetal, VMs and containers
-
Multiple cloud stack types: VMware, OpenStack, Kubernetes (via CNI), OpenShift
-
Multiple performance modes: kernel native, DPDK accelerated, and several different SmartNICs
-
Multiple overlay models: MPLS tunnels or direct, non-overlay mode (no tunneling)
TF fits seamlessly into LFN (Linux Foundation Networking) mission to foster open source innovation in the networking space.
The TF system is implemented as a set of nodes running on general-purpose x86 servers. Each node can be implemented as a separate physical server, or VM.
Initially, "Contrail" was a product of a startup company "Contrail system", which was acquired by Juniper Networks in Dec. 2012. It was open sourced in 2013 with a new name "OpenContrail" under the Apache 2.0 license, which means that anyone can use and modify the code of "Opencontrail" system without any obligation to publish or release the modifications. In early 2018, it was rebranded to "Tungsten Fabric" (abbreviated as "TF") as it transitioned into a fully-fledged Linux Foundation project. currently TF is still managed by the Linux Foundation.
Juniper also maintains a commercial version of the Contrail system, and provides commercial support to the payed users. Both The open-source version and commerical version of the Contrail system provide the same full functionalities, features and performances.
|
Note
|
Throughout this book, we use these terms "contrail", "opencontrail", "Tungsten Fabric" and "TF" interchangeably. |
TF architecture
TF consists of two main components:
-
Tungsten Fabric Controller: the SDN controller in the SDN architecture.
-
Tungsten Fabric vRouter: a forwarding plane that runs in each compute node performings packet forwarding and enforces network and security policies.
The communication between the controller and vRouters is via XMPP, which is a widely used messaging protocol.
A high level Tungsten Fabric architecture is shown below:
The TF SDN controller node
The TF SDN controller integrates with an orchestrator’s networking module in the form of a "plugin", for instance:
-
in OpenStack environment, TF interfaces with the Neutron server as a neutron plugin
-
in kubernetes environment, TF interfaces with k8s API server as a
kube-network-managerprocess and aCNIplugin that is watching the events from the k8s API.
TF SDN Controller is a so-called "logically centralized" but "physically distributed" SDN controller. It is "physically distributed" because same exact controllers can be running in multiple (typicall three) nodes in a cluster. However, all controllers work together to behaves consistently as a single logical unit that is responsible for providing the management, control, and analytics functions of the whole cluster.
This "physically distributed" nature of the Contrail SDN Controller is a distinguishing feature. Because there can be multiple redundant instances of the controller, operating in an "active/active" mode (as opposed to an "active-standby" mode). When everything works, two controllers can share the workload and load balance the control tasks. When a node becomes overloaded, additional instances of that node type can be instantiated after which the load is automatically redistributed. on the failure of any active node, the system as a whole can continue to operate without any interruption. This prevents any single node from becoming a bottleneck and allows the system to manage a very large-scale system. In production, a typical High-Availability (HA) deployment is to run three controller nodes in an active-active mode, single point failure is eliminated.
As any SDN controller, The TF controller has a "global view" of all routes in the cluster. it implements this by collecting the route information from all computes (where the TF Vrouters resides) and distributes these information throughout the cluster.
TF vRouter: compute node
Compute nodes are general-purpose virtualized servers that host VMs. These VMs can be tenants running general applications, or service VMs running network services such as a virtual load balancer or virtual firewall. Each compute node contains a TF vRouter that implements the forwarding plane.
The TF vRouter is conceptually similar to other existing virtualized switches such as the Open vSwitch (OVS), but it also provides routing and higher layer services. It replaces traditional Linux bridge and IP tables, or Open vSwitch networking on the compute hosts. Configured by TF controller, TF vRouter implement the desired networking and security policies. while workloads in same network can communicate with each other "by default", a explicit network policy is required to communicate with VMs in different networks.
As other overlay SDN solutions, TF vRouter extends the network from the physical routers and switches in a data center into a virtual overlay network hosted in the virtualized servers. Overlay tunnels are established between all computes, communication between VMs on different nodes are carried in these tunnels and behaves as if they are on the same compute. Currently vXLAN, MPLSoUDP and MPLSoGRE tunnels are supported.
TF controller components
In each TF SDN Controller there are three main components:
-
Configuration nodes keep a persistent copy of the intended configuration states and store them in cassandra database. they are also responsible for translating the high-level data model into a lower-level form suitable for interacting with control nodes.
-
Control nodes are responsible for propagating the low-level state data it received from configuration node to the network devices and peer systems in an eventually consistent way. They implements a logically centralized control plane that is responsible for maintaining network state. control nodes run XMPP with network devices, and run BGP with each other.
-
Analytics nodes are mostly about statistics and logging. They are responsible for capturing real-time data from network elements, abstracting it, and presenting it in a form suitable for applications to consume. it collect, store, correlate, and analyze information from network elements.
TF vRouter components
TF vRouter is running in each compute node. The compute node is a general-purpose x86 server that hosts tenant VMs running customer applications.
TF vRouter consists two components:
-
the vRouter agent: the local control plane.
-
the vRouter forwarding plane
|
Note
|
In the typical configuration, Linux is the host OS and KVM is the hypervisor. The Contrail vRouter forwarding plane can sits either in the Linux kernel space, or in the user space in dpdk mode. more details will be covered in later chapters. |
The vRouter agent is a user space process running inside Linux. It acts as the local, lightweight control plane in the compute, in a way similar to what "routing engine" does in a pysical router. For example, vRouter agent establish XMPP neighborships with two controller nodes, then exchances the routing information with them. vRouter agent also dynamically generate flow entries and inject them into the vRouter forwarding plane, this gives instructions to the vRouter about how to forward packets.
The vRouter forwarding plane works like a "line card" of a traditional router. it looks up its local FIB and determines the next hop of a packet. It also encapsulates packets properly before sending them to the overlay network and decapsulates packets to be received from the overlay network.
We’ll cover more details of TF vrouter in the later chapters.
TF workflow
chapter 2: SDN dataplane fundamentals
Virtualization concepts
Server virtualization
Kernel-based Virtual Machine (KVM) is an open source virtualization technology built into Linux. It provides hardware assist to the virtualization software, using built-in CPU virtualization technology to reduce virtualization overheads (cache, I/O, memory) and improving security.
QEMU is a hosted virtual machine emulator that provides a set of different hardware and device models for the guest machine. For the host, QEMU appears as a regular process scheduled by the standard Linux scheduler, with its own process memory. In the process, QEMU allocates a memory region that the guest sees as physical and executes the virtual machine’s CPU instructions.
With KVM, QEMU can just create a virtual machine with virtual CPUs (vCPUs) that the processor is aware of and runs native-speed instructions. When a special instruction is reached by KVM, like the ones that interacts with the devices or to special memory regions, vCPU pauses and informs QEMU of the cause of pause, allowing hypervisor to react to that event.
LibVirt is an Open Source toolkit to manage virtualization platforms. Libvirt is collection of softwares which allow to manage virtual machines and other virtualization functionality, such as storage and network interface management. LibVirt is proposing to define virtual components in a XML-formatted configurations, that are able to be translated into QEMU command line.
Inter Process Communication
Inter process communication (IPC) is a mechanism which allows processes to communicate with each other and synchronize their actions. The communication between these processes can be considered as a method of cooperation between them.
IPC is used in network virtualization in order to be able to exchange data between different distributed processes of a same application (for example, virtio frontend and backend, contrail vrouter agent and dataplane, etc …) or between processes of distinct applications (e.g., contrail vrouter and QEMU virtio, virtio and VFIO, and so on)
Two different modes of communication are used for IPC:
-
Shared Memory: processes are reading and writing information into shared memory region.
-
Message Passing: processes are establishing a communication link which will be used to exchange messages.
Shared Memory
Following scenario is used when shared memory is used for IPC:
-
First, a shared memory area is defined (shmget) with a key identifier known by processes involved into the communication.
-
Second, processes are attaching (shmat) to the shared memory and are retrieving a memory pointer.
-
Then, processes are reading or writing information in the shared memory using the shared memory pointer (read/write operation).
-
Next, processes are detaching from the shared memory (shmdt)
-
Last, the shared memory area is freed (shmctl)
Following system calls are used in shared memory IPC:
-
shmget: create the shared memory segment or use an already created shared memory segment.
-
shmat: attach the process to the already created shared memory segment.
-
shmdt: detach the process from the already attached shared memory segment.
-
shmctl: control operations on the shared memory segment (set permissions, collect information).
Message passing
Several message passing methods are available to exchange data information between processes:
-
eventfd: is a system call that creates an "eventfd object" (64-bit integer). It can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events.
-
pipe (and named pipe) are unidirectional data channel. Data written to the write-end of the pipe is buffered by the operating system until it is read from the read-end of the pipe.
-
Unix Domain Socket: domain sockets use the file system as their address space. Processes reference a domain socket as an inode, and multiple processes can communicate using a same socket. The server of the communication binds a Unix socket to a path in the file system, so a client can connect to it using that path.
There are some other mechanisms that can be used by processes to exchange messages (shared file, message queues, network sockets, and signals system calls) and are not described in this document.
Network device Architecture and concepts
Control and Data paths
Two different flows are used by a network application using a NIC device:
-
Control: manages configuration changes (activation/deactivation) and capability negotiation (speed, duplex, buffer size) between the NIC and network application for establishing and terminating the data path on which data packets will be transferred.
-
Data: performs data packets transfer between NIC and network application. Packet are transferred from NIC internal buffer to a host memory area which is reachable by the network application.
Each flow is using a well-defined path:
-
control path
-
data path
Event versus polling based packet processing
Linux network stack is using an event-based packet processing method. In such a method every incoming packet hitting the NIC:
-
is copied in host memory via DMA
-
then the NIC generates an interrupt.
-
then a Kernel module is placing the packet into a "socket buffer"
-
application runs a "read" system call
for every egress packet generated by the network application:
-
application performs a write call on the socket in order to copy the generated packet from the applications user space to a socket buffer
-
Kernel device driver invokes the NIC DMA engine to transmit the frame onto the wire.
-
Once transmission is complete, the NIC raises an interrupt to signal transmit completion in order to get socket buffer memory freed.
This method is not efficient when packets are hitting the NIC at a high packet rate. Lots of interrupts are generated, creating lots of context switching (kernel to user and vice-versa).
image::../diagrams/extracted-media-chapter2cleaned4adoc.docx/media/image3.png[image] Event based packet processing |
image::../diagrams/extracted-media-chapter2cleaned4adoc.docx/media/image4.png[image] polling based packet processing |
Polling based packet processing is an alternate method (it is used by DPDK). All incoming packets are copied transparently (without generating any interrupt) by the NIC into a specific host memory area region (predefined by the application). At a regular pacing, the network application is reading (polling) packets stored into this memory area.
On the opposing direction, the network application is writing packet into the shared memory area region. A DMA transfer is triggered to copy the packet from the host memory to the NIC card buffers.
No interrupt is used with this method, but it requires network application to check at a regular pacing whether a new packet has hit the NIC. This method is well suited for high rate packet processing: If packets are arriving at a slow rate this algorithm is less efficient as the event based one.
Network devices virtualization
Like CPU virtualization, two kinds of methods are used to virtualize network devices:
-
Software-Based Emulation.
-
Hardware-assisted Emulation.
Software Based Emulation are widely supported but can suffer of poor performance. Hardware assisted Emulation if providing good performance thanks to hardware acceleration, but it requires to use a hardware that supports some specific features.
Software-Based Emulation.
Two solutions are proposed for device virtualization with software:
-
Traditional Device Emulation (Binary Translation): the guest device drivers are not aware of the virtualization environment. During runtime, the Virtual Machine Manager (VMM), usually QEMU/KVM, will trap all the IO and Memory-mapped I/O (MMIO) accesses and emulate the device behavior (trap and emulate mechanism).
The Virtual Machine Manager (VMM) emulates the I/O device to ensure compatibility and then processes I/O operations before passing them on to the physical device (which may be different). Lots of VMEXIT (context switching) are generated with this method. It provides poor performance. -
Paravirtualized Device Emulation (virtio): the guest device drivers are aware of the virtualization environment. This solution uses a front-end driver in the guest that works in concert with a back-end driver in the Virtual Machine Manager (VMM). These drivers are optimized for sharing and have the benefit of not needing to emulate an entire device. The back-end driver communicates with the physical device. Performance are much better than with Traditional Device Emulation.
Software emulated devices can be completely virtual with no physical counterpart or physical ones exposing a compatible interface.
Hardware-assisted Emulation.
Two solutions are proposed for device virtualization assisted with hardware:
-
Direct Assignment: allows a VM to access directly to a network device. Thus the guest device drivers can directly access the device configuration space to, e.g., launch a DMA operation in a safe manner, via IOMMU.
Drawbacks: -
direct assignment has limited scalability. A physical device can only be assigned to one single VM.
-
IOMMU must be supported by the host CPU (Intel VT-d or AMD-Vi feature).
-
SR-IOV: with SR-IOV, each physical device (physical function) can appear as multiple virtual ones (aka virtual function). Each virtual function can be directly assigned to one VM, and this direct assignment is using the vt-d/IOMMU feature.
-
Drawbacks:
-
IOMMU must be supported by the host CPU (Intel VT-d or AMD-Vi feature).
-
SR-IOV must be supported by the NIC device (but also by the BIOS, the host OS and the guest VM).
Emulated network devices
The following two emulated network devices are provided with QEMU/KVM:
-
e1000 device: emulates an Intel E1000 network adapter (Intel 82540EM, 82573L, 82544GC).
-
rtl8139 device: emulates a Realtek 8139 network adapter.
Paravirtualized network device
Virtio is an open specification for virtual machines' data I/O communication, offering a straightforward, efficient, standard and extensible mechanism for virtual devices, rather than boutique per-environment or per-OS mechanisms. It uses the fact that the guest can share memory with the host for I/O to implement that.
Virtio was developed as a standardized open interface for virtual machines (VMs) to access simplified devices such as block devices and network adaptors.
Virtio frontend and backend
VirtIO interface is made of backend component and a frontend component:
-
The frontend component is the guest side of the virtio interface
-
The backend component is the host side of the virtio interface
Virtio transport protocol
virtio network driver is the VirtIO frontend component exposed into the guest VM
virtio network device is the VirtIO backend component exposed by the hypervisor.
Virtual Network frontend and backends are interconnected with a transport protocol (usually PCI/PCIe).
The virtio drivers must be able to allocate memory regions that both the hypervisor and the devices can access for reading and writing, via memory sharing. Two different domains have to be considered for a network device:
-
virtio device initialization, activation or shutdown (control plane)
-
network packets transfer through the virtio device (data plane)
Control plane is used for capability exchange negotiation between the host and guest both for establishing and terminating the data plane. Data plane is used for transferring the actual packets between host and guest.
Virtqueues are the mechanism for bulk data transport on virtio devices. They are composed of:
-
guest-allocated buffers that the host interacts with (read/write packets)
-
descriptor rings
Virqueues are controlled with I/O Registers notification messages:
-
Available Buffer Notification: virtio driver notifies there are buffers that are ready to be processed by the device.
-
Used Buffer Notification: virtio device notifies it has finished processing some buffers.
Virtio device network backend
The network backend that interacts with the emulated NIC and which is exposed on the host side. Usually network backend is a tap device. But other backends are proposed with VirtIO (SLIRP, VDE, Socket)
tap devices are virtual point-to-point network devices that the user space applications can use to exchange L2 packets. Tap devices are requiring tun kernel module to be loaded. Tun kernel modules create a kind of device in /dev/net system directory tree (/dev/net/tun).
Each new tap device has a name in the /dev/net/tree filesystem.
Virtio net backend drawbacks
The usual transport backend used by virtio net device is presenting some inefficiencies:
-
syscall and data copy are required for each packet to send or receive through the tap interface (no bulk transfer mode).
-
virtio driver (front end) notifies there are one available packet for the virtio device (backend) with an interrupt messages (IOCTL)
-
each interrupt message stops vCPU execution and generate a context switch (vmexit). Then the host processes the available packet and resume (vmexit) the VM execution using a syscall.
Each time a packet is sent, the VM stops to work to get the available packet processed.
Hypervisor is involved in both virtio control plane and data plane.
vhost protocol
vhost protocol was designed in order to address virtio device usual transport backend limitations. It’s a message-based protocol which allows the hypervisor to offload the data plane to a handler. The handler is a component which manage virtio data forwarding. The host hypervisor is no longer process packets.
The dataplane is fully offloaded to the handler that reads or writes packets to/from the virtqueues. vhost handler direclty access the virtqueues memory region as well as send and receive notification messages.
vhost handler is made up of two parts:
-
vhost-net
-
a kernel driver
-
it exposes a character device on /dev/vhost-net
-
uses ioctls to exchange vhost messages (vhost protocol control plane),
-
uses irqfd and ioeventfd file descriptor to exchange notifications with the guest.
-
spawns a vhost worker thread
-
vhost worker
-
a linux thread named vhost-<pid> (<pid> is the hypervisor process ID)
-
handles the I/O events (generated by virtio driver or tap device)
-
forwards packets (copy operations)
A tap device is still used to communicate the guest instance with the host, but the virtio dataplane is managed by vhost handler and is no more processed by the hypervisor.
Guest instances is no more stopped (context switch with a VMEXIT) at each VirtIO packet transfer.
New virtio vhost-net packet processing backend is completely transparent to the guest who still uses the standard virtio interface.
Physical network device Direct I/O Assignment
KVM guests usually have access to software based emulated NIC device (either para-virtualized devices with virtio or traditional emulated devices). On host machines which have Intel VT-d or AMD IOMMU hardware support, another option is possible. PCI devices may be assigned directly to the guest, allowing the device to be used with minimal performance overhead.
Assigned devices are physical devices that are exposed to the virtual machine. This method is also known as passthrough.
The VT-d or AMD IOMMU extensions must be enabled in BIOS in order to be able to perform for device Direct Assignment:
Two methods are supported:
-
PCI passthrough: PCI devices on the host system are directly attached to virtual machines, providing guests with exclusive access to PCI devices for a range of tasks. This enables PCI devices to appear and behave as if they were physically attached to the guest virtual machine.
-
VFIO device assignment: VFIO improves on previous PCI device assignment architecture by moving device assignment out of the KVM hypervisor and enforcing device isolation at the kernel level.
With VFIO the Physical device is exposed to the host user space memory and is made visible from the guest VM it has been assigned.
SR-IOV
Single Root I/O Virtualization (SR-IOV) specification is defined by the PCI-SIG (PCI Special Interest Group). This is a PCI Express (PCI-e) that extends a single physical PCI function to share its PCI resources as separate virtual functions (VFs).
The physical function contains the SR-IOV capability structure and manages the SR-IOV functionality (it can be used to configure and control a PCIe device).
A single physical port (root port) presents multiple, separate virtual devices as unique PCI device functions (up to 256 virtual functions – depends on device capabilities).
Each virtual device may have its own unique PCI configuration space, memory-mapped registers, and individual MSI-based interrupts. Unlike a physical function, a virtual function can only configure its own behavior. Each virtual function can be directly connected to a virtual machine via PCI device assignment (passthrough mode).
SR-IOV improves network device performance for each virtual machine as it can share a single physical device between several virtual machines using device direct I/O assignment method.
With SR-IOV, each VM has a direct access to the physical network using the assigned virtual function interface allocated to each. They can communicate altogether using the Virtual Ethernet Bridge provided by the NIC card. A virtual switch can also use SRIOV to get access to the physical network. VM using SRIOV assigned virtual function device has a direct access to the physical network and are not connected to any intermediate virtual network switch or router.
Following command can be used to check whether SR-IOV is supported or not on a physical NIC card:
$ lspci -s <NIC_BDF> -vvv | grep -i "Single Root I/O Virtualization"
VirtIO SR-IOV and SDN
VirtIO is bringing lots of flexibility. VirtIO is offering a standardized driver which is fully independent of the hardware used on the physical platform hosting VM instances.
When virtio connectivity is used VM can be easily migrated from one host to another using "live migration" feature. When SRIOV is use, this live migration is not an easy task and is not really possible to achieve.
Indeed, network driver used by VM depends on used hardware on the bare metal node which are hosting them. In order to make VM migration from one bare metal node to another, both nodes must at least to use same hardware NIC model. But when SRIOV is used VM connectivity is having barely the same performance has a real physical NIC, whereas with VirtIO, performance could be poor.
Also, SRIOV, providing a direct access to the physical NIC is making host virtual network nodes (virtual router/switch) used by SDN solution totally blind about VM using such connectivity. Local traffic switching between VM connected on a same SRIOV physical card is achieve by the Virtual Ethernet bridge proposed by SRIOV. Communication between VM connected onto distinct SRIOV physical ports must rely on physical network.
SDN vswitch/vrouter usage is very limited when SRIOV is used. Indeed, packet switching between VMs which are using VFs of a same SR-IOV physical port are using the physical Virtual Ethernet Bridge hosted in the physical NIC.
Only some few use cases are relevant, which are:
-
Provide internal connectivity between VM using distinct SR-IOV physical ports (it avoids to send the traffic out of the server to be processed by the physical network)
-
Build hybrid mode solutions with multi-NIC VM. Network traffic not requiring high performance is using emulated NIC (management traffic for instance). Network connectivity requiring high performance will be processed by SRIOV assigned NIC (for instance video data traffic).
With SRIOV we are getting high performance but with poor flexibility and no network virtualization features. With VirtIO we are getting a high level of network virtualization suitable for SDN, which is very flexible with poor performances.
For SDN use cases, we need network virtualization features and performance. DPDK will bring both.
Network Packer processing performance requirements
Ethernet minimum frame size is 64 Bytes. When Ethernet frames are sent onto the wire, Inter Frame Gap and Preamble bits are added. Minimum size of Ethernet frames on the physical layer is 84 Bytes (672 bits).
For a 10 Gbit/s interface, the number of frames per seconds can reach up to 14.88 Mpps for traffic using the smallest Ethernet frame size. It means a new frame will have to be forwarded each 67 ns.
A CPU running at 2Ghz has a 0.5 ns cycle. Such a CPU has a budget of only 134 cycles per packet to be able to process a flow of 10 Gb/s.
Generic Linux Ethernet drivers are not performant enough to be able to process such a 10Gb/s packet flow. Indeed, with regular Linux NIC drivers lots of times are required to:
-
perform packet processing in Linux Kernel using interrupt mechanism,
-
transfer application data from host memory to Network Interface card
DPDK is one of the most used solution available allowing to build a network application using high-speed NICs and working at wire speed. Therefore, Contrail is proposing DPDK as one of the solutions to be used for the physical compute connectivity.
DPDK and Network applications
DPDK application working principle
DPDK is dedicating one (or more) CPU to one (or more) thread that are continuously polling a one (or more) DPDK NIC RX queue. CPU on which a DPDK polling thread is started will be loaded at 100% whatever there some packets to process or not, as no interrupt mechanism is used in DPDK to warn the DPDK application that a packet has been received.
Using DPDK library API, physical NIC packets will be made available into user space memory in which the DPDK application is running. So, when DPDK is used there is no user space to kernel space context switching and it saves lots of CPU cycles. Also, the host memory is using large continuous memory area, the huge pages, which allow large data transfers and avoid high data fragmentation in memory which would require a higher memory management effort at the application level. Such a fragmentation would also cost some precious CPU cycles.
Hence, most of the CPU cycles of DPDK pinned CPU are used for polling and processing packets delivered by the physical NIC in DPDK queues. As a result, the packet forwarding task can be processed at a very high speed. If one CPU is not powerful enough to manage incoming packets that are hitting the physical NIC at a very high rate; we can allocate an additional one to the DPDK application in order to increase its packet processing capacity.
A DPDK application is a multi-thread program that is using DPDK library to process network data. In order to scale, we can start several packet polling and processing threads (each one pinned on a dedicated CPU) that are running in parallel.
3 main components are involved into a DPDK application:
-
Physical NIC
-
buffering packets in physical queues
-
using DMA to transfer packets in host memory
-
-
DPDK NIC abstraction with its queue representation in huge pages host memory:
-
descriptor rings
-
mbuf (to store packets)
-
-
Linux pThread use to poll and process packets received in DPDK NIC queues.
DPDK overview
Data Plane Development Kit (DPDK) is a set of data plane libraries and network interface controller drivers for fast packet processing, currently managed as an open-source project under the Linux Foundation.
The main goal of the DPDK is to provide a simple, complete framework for fast packet processing in data plane applications.
The framework creates a set of libraries for specific environments through the creation of an Environment Abstraction Layer (EAL), which may be specific to a mode of the Intel® architecture (32-bit or 64-bit), Linux* user space compilers or a specific platform.
These environments are created through the use of make files and configuration files. Once the EAL library is created, the user may link with the library to create their own applications.
The DPDK implements a "run to completion model" for packet processing, where all resources must be allocated prior to calling Data Plane applications, running as execution units on logical processing cores.
The model does not support a scheduler and all devices are accessed by polling. The primary reason for not using interrupts is the performance overhead imposed by interrupt processing.
For more information please refer to dpdk.org documents http://dpdk.org/doc/guides/prog_guide/index.html
DPDK software architecture
DPDK is a set of programing libraries that can be used to create an application that needs to process network packets at a high speed. DPDK is proposing following functions:
-
A queue manager implements lockless queues
-
A buffer manager pre-allocates fixed size buffers
-
A memory manager allocates pools of objects in memory and uses a ring to store free objects
-
Poll mode drivers (PMD) are designed to work without asynchronous notifications, reducing overhead
-
A packet framework made up of a set of libraries that are helpers to develop packet processing
In order to reduce Linux user to kernel space context switching all these functions are made available by DPDK into the user space where applications are running. User applications using DPDK libraries have a direct access to the NIC cards, without passing through a NIC Kernel driver as it is required when DPDK is not used.
Regular Network Application |
Network Application with DPDK |
DPDK is allowing to build user-space multi-thread network application using the POSIX thread (pthread) library.
DPDK is a framework which is made of several libraries:
-
Environment Abstraction Layer (EAL)
-
Ethernet Devices Abstraction (ethdev)
-
Queue Management (rte_ring)
-
Memory Pool Management (rte_mempool)
-
Buffer Management (rte_mbuf)
-
Timer Manager (librte_timer)
-
Ethernet Poll Mode Driver (PMD)
-
Packet Forwarding Algorithm made up of Hash (librte_hash) and Longest Prefix Match (LPM,librte_lpm) libraries
-
IP protocol functions (librte_net)
Ethdev library exposes APIs to use the networking functions of DPDK NIC devices. The bottom half part of ethdev is implemented by NIC PMD drivers. Thus some features may not be implemented.
Poll Mode ethernet Drivers (PMDs) are a key component for DPDK. These PMDs by-pass the kernel and are providing a direct access to the Network Interface Cards (NIC) used with DPDK.
Linux user space device enablers (UIO or VFIO) are provided by Linux Kernel and are required to run DPDK.
They are allowing to discover and expose PCI devices information and address space through the /sys directory tree.
DPDK libraries are allowing kernel-bypass application development:
-
probing for PCI devices (attached via a Linux user space device enabler),
-
huge-page memory allocation,
-
data structures geared toward polled-mode message-passing applications:
-
such as lockless rings
-
memory buffer pools with per-core caches.
-
The diagram below is providing an overview of DPDK libraries.
Only few libraries have been described in this diagram: Set of libraries is enriched at each new DPDK release (cf: https://www.dpdk.org/).
DPDK Environment Abstraction Layer
The Environment Abstraction Layer (EAL) is responsible to provide access to low-level resources such as hardware and memory space. It provides a generic interface that hides the environment specifics from the applications and libraries. The EAL performs physical memory allocation using mmap() in hugetlbfs (using huge page sizes to increase performance).
Provided services by EAL are:
-
DPDK loading and launching
-
Support for multi-process and multi-thread execution types
-
Core affinity/assignment procedures
-
System memory allocation/de-allocation
-
Atomic/lock operations
-
Time reference
-
PCI bus access
-
Trace and debug functions
-
CPU feature identification
-
Interrupt handling
-
Alarm operations
-
Memory management (malloc)
DPDK memory management
DPDK optimized memory management for speed
DPDK has a highly optimized memory manager. DPDK works on a group of fixed size objects called a mempool. Every one of them are pre-allocated. DPDK does not encourage dynamic allocations because it consumes a lot of CPU cycles and it is a speed killer.
DPDK stores incoming packets into mbufs (memory buffers). DPDK pre-allocates a set of mbufs and keeps it in a pool called mempool.
DPDK makes use of mempools each time it needs to allocate a mbuf where packets are stored. Instead of allocating a single mbuf, DPDK do a bulk allocation, or bulk free once packets are consumed. By doing this, packets to be processed (mbufs) are already in cache memory. Therefore, DPDK is very cache friendly.
Mempool has further optimizations. It is very cache friendly. Everything is aligned to the cache and has a some mbufs allocated for each DPDK thread or lcore. Each mempool are also bound with rings which are referencing mbufs containing packets stored into mempool.
Each ring is a highly optimized lockless ring. It can be used by several lcores in a multi-producer/multi-consumer kind of scenario without locks. By avoiding locks, DPDK gets large performance gains, as data structures locking is also a speed killer.
mbufs and mempools
Network Data are stored in compute central memory (in huge page area).
DPDK uses message buffers known as mbufs to store packet data into the host memory.
These mbufs are stored in memory pools known as mempools.
mbufs are storing DPDK NIC incoming and outgoing packets which have to be processed by the DPDK application.
Packet descriptors
DPDK queues are not storing the packets but a pointer onto the real packet.
It avoids performing a data transfer that would be needed when packets have to be forward from a DPDK NIC to another.
Packets are not moved from one queue to another, but these are descriptors (pointers) that are moving from one queue to another.
DPDK rings
Descriptors are set up as a ring. A ring is a circular array of descriptors. Each ring describes a single direction DPDK NIC queue.
Each DPDK NIC queue is made up of 2 rings (1 per direction: 1 RX ring, 1 TX ring).
Each descriptor points onto a packet that has been received (RX ring) or that is going to be transmitted (TX ring).
The more descriptors RX/TX rings are containing, the more memory size will be required in each mempool (number of mbufs) to store data.
Data Transfer between host NIC and memory
DPDK application is only processing packets that are exposed in user space host OS memory.
DPDK rings are an abstraction of the real NIC queues: DPDK is using DMA to keep synchronized at anytime between the NIC hardware queues and its DPDK representation in the host memory.
Physical NIC incoming packets
When an incoming packet is reaching the physical NIC interface, it is stored in NIC physical queue memory. RX ring is managing packets that have to be processed by a DPDK application.
Synchronization between the host OS and the NIC happens through two registers, whose content is interpreted as an index in the RX ring:
-
Receive Descriptor Head (RDH): indicates the first descriptor prepared by the OS that can be used by the NIC to store the next incoming packet.
-
Receive Descriptor Tail (RDT): indicates the position to stop reception, i.e. the first descriptor that is not ready to be used by the NIC.
DMA transfer is copying transparently packets from physical NIC memory to the host central memory. DMA is using RDT descriptor as destination memory address for the data to be transferred.
Once packets have been transferred into host memory both RX rings and RDT are updated.
Physical NIC outgoing packets
When a packet has to be sent from host memory to the physical NIC interface, it is referenced in NIC TX ring by the DPDK application. TX ring is managing packets that have to be transferred onto a NIC card.
Synchronization between the host OS and the NIC happens through two registers, whose content is interpreted as an index in the TX ring:
-
Transmit Descriptor Head (TDH): indicates the first descriptor that has been prepared by the OS and has to be transmitted on the wire.
-
Transmit Descriptor Tail (TDT): indicates the position to stop transmission, i.e. the first descriptor that is not ready to be transmitted, and that will be the next to be prepared.
DPDK and packet processing
Linux pthreads
Multithreading is the ability of a CPU (single core in a multi-core processor architecture) to provide multiple threads of execution concurrent. In a multithreaded application, the threads share some CPU resources memory:
-
CPU caches
-
translation lookaside buffer (TLB)
A single Linux process can contain multiple threads, all of which are executing the same program. These threads share the same global memory (data and heap segments), but each thread has its own stack (local variables).
Linux pThreads (POSIX threads) is a C library which contains a set functions that are allowing to manage threads into an application. DPDK is using Linux pThreads library.
DPDK lcores
DPDK is using threads that are designed as "lcore”. A “lcore" refers to an EAL thread, which is really a Linux pthread, which is running onto a single processor execution unit.
-
first lcore: that executes the main() function and that launches other lcores is named master lcore.
-
any lcore: that is not the master lcore is a slave lcore.
Lcores are not sharing CPU units. Nevertheless, if the host processor supports hyperthreading, a core may include several lcores or threads.
lcores are used to run DPDK application packet processing threads. Several packet processing models are proposed by DPDK. The simplest one is the Run-To-Completion model.
Run-to-Completion, is using a single thread (lcore) for end to end packet processing (packet polling, processing and forwarding).
Multicore Scaling - Pipeline model
A complex application is typically split across multiple cores, with cores communicating through Software queues.
Packet Framework facilitates the creation of pipelines. Each pipeling thread is assigned to a CPU and is using software queues like output or/and input ports.
For instance, Contrail DPDK vRouter is using such a model for GRE encapsulated packet processing.
Control Threads
It is possible to create Control Threads. Those threads can be used for management/infrastructure tasks and are used internally by DPDK for multi process support and interrupt handling.
Service Core
DPDK service cores enables a dynamic way of performing work on DPDK lcores. Service core support is built into the EAL, and an API is provided to optionally allow applications to control how the service cores are used at runtime.
DPDK and Poll Mode Drivers (PMD)
When DPDK is used, Network interfaces are no more managed in Kernel space. Regular Linux NIC driver which is usually used to manage the NIC has to be replaced by a new driver which is able to run into user space. This new drive, called Poll Mode Driver (PMD) will be used to manage the network interface into user space with the DPDK library.
Physical NIC and BAR registers
PCI devices have a set of registers referred to as configuration space for devices. These configuration space registers are mapped to host memory locations.
When a PCI device is enabled, the system’s device drivers (by writing configuration commands to the PCI controller) programs the Base Address Registers (BAR) to inform the PCI device of its address mapping. Next, the host operating system is able to address this PCI device.
Linux NIC drivers
With usual Linux NIC Kernel, both NIC configuration and Packet processing is done in Kernel Space. User applications which have to establish a TCP connection or send a UDP packet is using the sockets API, exposed by libc library.
NIC configuration |
NIC packet processing |
Linux Packet Processing with sockets API is requiring following operations which are costly:
-
Kernel Linux System calls
-
Multitask context switching on blocking I/O
-
Data copying from kernel (ring buffers) to user space
-
Interrupt handling in kernel
With usual Linux Drivers most of operations are occurring in Kernel modes and are requiring lots of user space to kernel space context switching and interruption mechanisms. The heavy context switching usage is costing lots of CPU cycles and is a limiting the numbers of packets that a CPU is able to process. Such drivers are not able to perform packet processing at expected high speed, especially when 10/40/100G Ethernet generation cards are used on a Linux System.
Poll Mode Drivers
A Poll Mode Driver consists of APIs, running in user space, to configure the devices and their respective queues. In addition, a PMD accesses the RX and TX descriptors directly without any interrupts (with the exception of Link Status Change interrupts) to quickly receive, process and deliver packets in the user’s application.
Poll Mode drivers are involved in NIC configuration. They are exposing NIC configuration registers into host memory area which is directly reachable from user space.
NIC configuration |
NIC packet processing |
In short, Poll Mode Drivers are user space pthreads which:
-
call specific EAL functions
-
have a per NIC implementation
-
have direct access to RX/TX descriptors
-
use Linux user space device enablers (UIO or VFIO) driver for specific control changes (interrupts configuration)
Hence user applications can configure directly the NIC cards they are using from Linux user space where they are running.
A first configuration phase is using Poll Mode Drivers and DPDK library to configure DPDK rings buffers into Linux user space. Next, incoming packets will be automatically transferred with DMA (Direct Memory Access) mechanism from NIC physical RX queues in NIC memory to DPDK RX rings buffer in host memory. DMA (Direct Memory Access) is also used to transfer outgoing packets from DPDK TX rings buffer in host memory to NIC physical TX queues in NIC memory. DMA offloads expensive memory operations, such as large copies or scatter-gather operations, from the CPU.
Direct Memory Access (DMA)
Direct Memory Access (DMA) allows PCI devices to read (write) data from (to) memory without CPU intervention. This is a fundamental requirement for high performance devices.
DMA is a mechanism that is using a specific hardware controller to manage read and write operations into the main system memory (RAM: Random Access Memory). This mechanism is totally independent of the central processing unit (CPU) and does not consume any CPU resource. A DMA transfer is used to manage data transfer. DMA transfer is triggered by the CPU and is working in background using the specific hardware resource (DMA controller).
DPDK rings and NIC buffers are synchronized with DMA. Thanks to this synchronization mechanism, DPDK application can access transparently to NIC packets in user space reading or writing data in DPDK rings.
IOMMU
Input–Output Memory Management Unit (IOMMU) is a memory management unit (MMU) that connects a Direct Memory Access (DMA) capable I/O bus to the main memory.
In Virtualization, an IOMMU is re-mapping the addresses accessed by the hardware into a similar translation table that is used to map guest virtual machine address memory to host-physical addresses memory.
IOMMU provides a short path for device to get access only to a well scoped physical device memory area which corresponds to a given guest virtual machine memory. IOMMU helps to prevent DMA attacks that could be originated by malicious devices. IOMMU provides DMA and interrupt remapping facilities to ensure I/O devices behave within the boundaries they’ve been allotted.
Intel has published a specification for IOMMU technology as Virtualization Technology for Directed I/O, abbreviated as VT-d.
In order to get IOMMU enabled:
-
both kernel and BIOS must support and be configured to use IO virtualization (such as Intel® VT-d).
-
IOMMU must be enabled into Linux Kernel parameters in
/and runetc/default/grubupdate-grubcommand.
GRUB configuration example with IOMMU Passthrough enabled:
GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt intel_iommu=on" |
DPDK supported NICs
DPDK Library includes Poll Mode Drivers (PMDs) for physical and emulated Ethernet controllers which are designed to work without asynchronous, interrupt-based signaling mechanisms.
-
Available DPDK PMD for physical NIC:
-
I40e PMD for Intel X710/XL710/X722 10/40 Gbps family of adapters http://dpdk.org/doc/guides/nics/i40e.html
-
Linux bonding PMD http://dpdk.org/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.html
-
-
Available DPDK PMD for Emulated NIC:
-
DPDK EM poll mode driver supports emulated Intel 82540EM Gigabit Ethernet Controller (qemu e1000 device):
http://doc.dpdk.org/guides/nics/e1000em.html -
Virtio Poll Mode driver for emulated VirtIO NIC
http://dpdk.org/doc/guides/nics/virtio.html -
VMXNET3 NIC when VMWare hypervisors are used:
http://doc.dpdk.org/guides/nics/vmxnet3.html
-
Lots of other NIC are supported by DPDK (cf http://doc.dpdk.org/guides/nics/overview.html).
Different PMDs may require different kernel drivers in order to work properly (cf Linux User space device enablers). Depending on the PMD being used, a corresponding kernel driver should be loaded and bound to the network ports.
This is also preferable that each NIC has been flashed with the latest version of NVM/firmware.
Linux user space device enablers
Most of PMD are using generic user space device enablers to expose physical NIC registers in user space into the host memory. Two space device enablers are widely used by DPDK PMD they are UIO and VFIO.
UIO - User Space IO
Linux kernel version 2.6 introduced the User Space IO (UIO) loadable module. UIO is a kernel-bypass mechanism which provides an API that enables user space handling of legacy interrupts (INTx).
UIO has some limitations:
-
UIO does not manage message-signaled interrupts (MSI or MSI-X).
-
UIO also does not support DMA isolation through IOMMU.
UIO only supports legacy interrupts so it is not usable with SR-IOV and virtual hosts which require MSI/MSI-X interrupts.
Despite these limitations, UIO is well suited for use in virtual machines, where direct IOMMU access is not available. In such a situation, a guest instance user space process is not isolated from other processes in the same instance. But the hypervisor can isolate any guest instance from others or hypervisor host processes using IOMMU.
Currently, two UIO modules are supported by DPDK:
-
Linux Generic (uio_pci_generic), which is the standard proposed UIO module included in the Linux kernel.
-
DPDK specific (igb_uio) which must be compiled with the same kernel as the one running on the target.
DPDK specific UIO Kernel module is loaded with insmod command after UIO module has been loaded:
$ sudo modprobe uio $ sudo insmod kmod/igb_uio.ko
While a single command is needed to load Linux Generic UIO module:
$ sudo modprobe uio_pci_generic
DPDK specific UIO module could be preferred in some situation to Linux Generic UIO module (cf: https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html)
VFIO – Virtual Function I/O
Virtual Function I/O (VFIO) kernel infrastructure was introduced in Linux version 3.6.
VFIO provides a user space driver development framework allowing user space applications to interact directly with hardware devices by mapping the I/O space directly to the application’s memory.
VFIO is a framework for building user space drivers that provides:
-
Mapping of device’s configuration and I/O memory regions to user memory
-
DMA and interrupt remapping and isolation based on IOMMU groups.
-
Eventfd and irqfd based signaling mechanism to support events and interrupts from and to the user space application.
VFIO exposes APIs which allow to:
-
create character devices (in /dev/vfio/)
-
support ioctl calls
-
support mechanisms for describing and registering interrupt notification.
VFIO driver is an IOMMU/device agnostic framework for exposing direct device access to user space, in a secure, IOMMU protected environment. For bare-metal environments, VFIO is the preferred framework for Linux kernel-bypass. It operates with the Linux kernel’s IO.
MMU subsystem is used to place devices into IOMMU groups. User space processes can open these IOMMU groups and register memory with the IOMMU for DMA access using VFIO ioctl calls. VFIO also provides the ability to allocate and manage message-signaled interrupt vectors.
A single command is needed to load VFIO module:
$ sudo modprobe vfio_pci
Despite VFIO has been created to work with IOMMU, VFIO can be also be used without (this is just as unsafe as using UIO).
Linux user space device enablers to be used
VFIO is generally the preferred Linux user space device enabler to be used because it supports IOMMU to protect host memory. When a real hardware PCI device is attached to host system and IOMMU is used with VFIO, all the reads/writes of that device done in user space by the DPDK application will be protected by the host IOMMU.
But there some is few exceptions. Below is Intel recommendation for the choice of the Kernel driver to be used with DPDK:
DPDK and Host Hardware architecture
NUMA
NUMA means Non-Uniform Memory Access systems
A traditional server has a single CPU, a single RAM and a single RAM controller.
A RAM can be made of several DIMM banks in several sockets, all being associated to the CPU. When the CPU needs access to data in RAM, it requests it to its RAM controller.
Recent servers can have multiple CPUs, each one having its own RAM and its own RAM controller. Such systems are called NUMA systems, or Non-Uniform Memory Access. For example, in a server with 2 CPUs, each one can be a separate NUMA: NUMA0 and NUMA1.
NUMA nodes architecture.
-
In green: CPU core accessing a memory item located in its own NUMA’s RAM controller, showing minimum latency.
-
In red: CPU core accessing a memory item located in the other NUMA through the QPI (Quick Path Interconnect) path and the remote RAM controller, showing a higher latency.
When CPU0 needs to access data located in RAM0, it will go through its local RAM controller 0. Same thing happens for CPU1.
When CPU0 needs to access data located in the other RAM1, the first (local) controller 0 has to go through the second (or remote) RAM controller 1 which will access the (remote) data in RAM 1. Data will use an internal connection between the 2 CPUs called QPI, or Quick Path Interconnect, which is typically of a high enough capacity to avoid being a bottleneck, typically 1 or 2 times 25GBps (400 Gbps). For example, the Intel Xeon E5 has 2 CPUs with 2 QPI links between them; Intel Xeon E7 has 4 CPUs, with a single QPI between pairs of CPUs.
The fastest RAM that the CPU has access to is the register, which is inside the CPU and reserved to it.
Beyond the register, the CPU has access to cached memory, which is a special memory based on higher performance hardware.
Cached memories are shared between the cores of a single CPU. Typical characteristics of memory cache are:
-
Accessing a Level 1 cache takes 7 CPU cycles (with a size of 64KB or 128KB).
-
Accessing a Level 2 cache takes 11 CPU cycles (with a size of 1MB).
-
Accessing a Level 3 cache takes 30 CPU cycles (with a larger size).
If the CPU needs to access data that is in the main RAM, it has to use its RAM controller.
Access to RAM takes typically 170 CPU cycles (the green line in the diagram). Access to the remote RAM through the remote RAM controller typically adds 200 cycles (the red line in the diagram), meaning RAM latency is roughly doubled.
When data needed by the CPU is located both in the local and in the remote RAM with no particular structure, latency to access data can be unpredictable and unstable.
Hyper-threading (HT)
A single physical CPU core with hyper-threading appears as two logical CPUs to an operating system.
While the operating system sees two CPUs for each core, the actual CPU hardware only has a single set of execution resources for each core.
Hyper-threading allows the two logical CPU cores to share physical execution resources.
The sharing of resources allows two logical processors to work with each other more efficiently and allows a logical processor to borrow resources from a stalled logical core (assuming both logical cores are associated with the same physical core). Hyper-threading can help speed processing up, but it’s nowhere near as good as having actual additional cores.
Huge pages
Memory is managed in blocks known as pages. On most systems, a page is 4KB. 1MB of memory is equal to 256 pages; 1GB of memory is 256,000 pages, etc. CPUs have a built-in memory management unit that manages a list of these pages in hardware.
The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings.
If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly.
If not, a TLB miss occurs, and the system falls back to slower, software-based address translation.
This results in performance issues.
Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.
Virtual memory address lookup slows down when the number of entries increases.
A huge page is a memory page that is larger than 4Ki. In x86_64 architecture, in addition to standard 4KB memory page size, two larger page sizes are available: 2MB and 1GB.
Contrail DPDK vrouter can use both or only one huge page size.
CPU isolation and pining
An Operating System is using a scheduler to place each single process and/or threads it has to run onto one CPUs offered by a host.
There are two kinds of scheduling, cooperative and preemptive. By default, Linux scheduler is using a cooperative mode.
In order to get a CPU booked for a subset of tasks, we have to inform the Operating System scheduler not to use these CPUs for all the tasks it has to run.
These CPUs are told: "isolated" because they are no more used by the OS to process all tasks. In order to get a CPU isolated several mechanisms can be used:
-
remove this CPU from the "common" CPU list used to process all tasks
-
change the scheduling algorithm (cooperative to preemptive)
-
participate or not to interrupt processing
-
Isolation and pinning are two complementary mechanisms that are proposed by Linux OS:
-
CPU isolation restricts the set of CPUs that are available for Operating System Scheduler level. When a CPU is isolated, no task will be scheduled on it by the Operating System. An explicit task assignment must be done.
-
CPU pinning is also called processor affinity. It enables the binding and unbinding of process or a thread onto a CPU.
On the opposite, CPU pinning is a mechanism that consists in defining a limited set of CPUs that are allowed to be used by:-
the OS Scheduler. Operating System CPU affinity is managed through systemd.
-
a specific process: using CPU pinning rules (taskset command for instance)
-
Tasks to be run by an operating system must be spread across available CPUs. These tasks in a multi-threading environment are often made of several processes which are also made of several threads.
CPU isolation mechanisms
isolcpus
isolcpus is a Kernel scheduler option. When a CPUs is specified in isolcpus list, it is removed from the general kernel SMP balancing and scheduler algorithms. The only way to move a process onto or off an "isolated" CPU is via the CPU affinity syscalls (or to use the taskset command).
This isolation mechanism:
-
remove isolated CPUs from the "common" CPU list used to process all tasks
-
change the scheduling algorithm from cooperative to preemptive
-
perform CPU isolation at the system boot
isolcpus is suffering of lots of drawbacks; that are:
-
it requires manual placement of processes on isolated cpus.
-
it is not possible to rearrange the CPU isolation rules after the system startup
-
the only way to change isolated CPU list is by rebooting with a different isolcpus value in the boot loader configuration (GRUB for instance).
-
isolcpus is disabling the scheduler load balancer for isolated CPUs. It also means the kernel will not balance those tasks equally among all the CPUs sharing the same isolated CPUs (having the same affinity mask)
CPU shield
cgroups subsystem is proposing a mechanism to dedicate some CPUs to one or several user processes. It consists in defining a "user shield" group which is protecting a subset of CPU system tasks.
3 cpusets are defined:
-
root: present in all configurations and contains all cpus (unshielded)
-
system: contains cpus used for system tasks - the ones which need to run but aren’t "important" (unshielded)
-
user: contains cpus used for tasks we want to assign a set of CPU for their exclusive use (shielded)
CPU shield are manipulated with cset shield command.
Tuned
Tuned is a system tuning service for Linux. Tuned is using Tuned profiles to describe Linux OS performance tuning configuration.
The cpu-partitioning profile partitions the system CPUs into isolated and housekeeping CPUs. This profile is intended to be used for latency-sensitive workloads.
PS: Tuned is only supported on Linux RedHat OS family.
Linux systemd - System task CPU affinity
A thread’s CPU affinity mask determines the set of CPUs on which it is eligible to run.
Linux systemd is a software suite that provides an array of system components for Linux operating systems. Its primary component is an init system used to bootstrap user space and manage user processes.
CPUAffinity parameter restricts all processes spawned by systemd to the list of cores defined by the affinity mask.
default CPU affinity
When run as a system instance, systemd interprets the configuration file /etc/systemd/system.conf. In this configuration file CPUAffinity variable configures the CPU affinity for the service manager as well as the default CPU affinity for all forked off processes.
Per service specific CPU affinity
Individual services may override the CPU affinity for their processes with the CPUAffinity setting in unit files
# vi /etc/systemd/system/<my service>.service ... [Service] CPUAffinity=<CPU mask>
If a specific CPUAffinity has been defined for a given service, it has to be restarted in order for the new configuration file to be taken into consideration.
CPU assignment for user processes (taskset)
taskset is used to set or retrieve the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity.
We can retrieve the CPU affinity of an existing task:
# taskset -p pid
Or set it:
# taskset -p mask pid
Bind a virtual NIC to DPDK
DPDK requires a direct NIC access into user space. VirtIO vhost-user backend is exposing the virtio network device in user space.
vhost-user is a library that implements the vhost protocol in user space. Vhost-user library allows to expose a VirtIO backend interface into user space.
vhost-user library defines the structure of messages that are sent over a unix socket to communicate with the VirtIO net device backend (vhost-net kernel driver is using ioctls instead)
Kernel Mode Virtual Machine connected to a DPDK compute application
User application is using both:
-
vhost user library: for emulated PCI NIC control plane
-
DPDK libraries: for emulated PCI NIC data plane
Support for user space vhost has been provided with QEMU 2.1 and above.
Run DPDK in a guest VM
Virtual IOMMU
Virtual IOMMU (vIOMMU) is allowing to emulate IOMMU for guest VMs.
vIOMMU has the following characteristics:
-
translates guest virtual machine I/O Virtual Addresses (IOVA) to guest Physical Addresses (GPA)
-
Guest virtual machine Physical Addresses (GPA) are translated to Host Virtual Addresses (HVA) through the hypervisor memory management system.
-
performs device isolation.
-
implements a I/O TLB (Translation Lookaside Buffer) API which exposes memory mappings
In order to get a virtual device working with a virtual IOMMU we have to:
-
create the needed IOVA mappings into the vIOMMU
-
configure the device’s DMA with the IOVA
Following mechanisms can be used to create vIOMMU memory mappings:
-
Linux Kernel’s DMA API for kernel drivers
-
VFIO for user space drivers
The integration between the virtual IOMMU and any user space network application like DPDK is usually done through the VFIO driver. This driver will perform device isolation and automatically add the memory (IOVA -to GPA) mappings to the virtual IOMMU.
The use of hugepages memory in DPDK contributes to optimize TLB lookups, since a fewer number of memory pages can cover the same amount of memory. Consequently, the number of Device TLB synchronization messages drop dramatically. Hence, the performance penalty TLB lookups is lowered.
Virtio Poll Mode Driver
Virtio-pmd driver, is a DPDK driver, built on the Poll Mode Driver abstraction, that implements the virtio protocol.
Vhost user protocol moves the virtio ring from kernel all the way to userspace. The ring is shared between the guest and DPDK application. QEMU sets up this ring as a control plane using unix sockets.
If the both the host server guest virtual machine are DPDK there are no VMExits in the host for guest packets processing. Guest virtual machine uses virtio-net PMD driver and performs packets polling. So. There is nothing running in kernel here, so there are no system calls. Since both system calls and VM Exits are avoided, the performance boosts significantly. It will be an order higher.
Physical Network Device Assignment (VFIO) and PCI passthrough
When a DPDK application is running into a guest Virtual Machine, a mechanism has to be used to expose one of the host physical NIC to this guest in order it gets access to the physical network.
IOMMU protects host memory against malicious or bug writes which can corrupt host memory at any time. But, when a physical device is assigned to a guest virtual machine without vIOMMU usage, the guest memory address space is totally exposed to the hardware PCI device.
A PCI device can be assigned to a guest in order to be used by a guest DPDK application. By leveraging VFIO driver in the host kernel we provide a direct access to an assigned physical NIC from this guest protected with IOMMU.
Next, by leveraging VFIO driver in the guest kernel we provide a direct access to the assigned physical from this guest user space. vIOMMU is providing a secure mechanism to manage DMA transfer between an assigned physical hardware and hosted guest virtual instance memory area.
SRIOV and DPDK in Guest VM
This use case is almost the same as PCI passthrough. VFIO and IOMMU are used to expose a SRIOV virtual function directly to a guest VM.
An additional Physical function driver which is vendor specific is used to manage the virtual function creation on the physical NIC. This driver is used by a Virtual Machine Manager (like libvirt) to create the virtual function before the virtual instance is spawned.
Physical incoming packets are directly copied in guest memory without involving the host server. SR-IOV only allow to share a physical NIC between several guests but does not change the packet processing path provided by PCI passthrough.
VirtIO assisted Hardware acceleration
With DPDK and VirtIO we have a technology that is allowing to get network virtualization at a high speed. This is a key technology for SDN dataplane.
But this packet processing model has still some drawbacks:
-
DPDK is requiring isolating some host CPUs for its exclusive need. These is some less CPU resources for the user application
-
Compute CPU are generic and are not optimized for packet processing. DPDK is requiring lots of CPU usage to provide a both feature rich and performant virtual network (host compute for DPDK vrouter/vswich application and on guest VM for DPDK end-user application.
SR-IOV is bringing performance but it’s use is limited in SDN application due to it’s direct path between guest VM and the NIC hardware which bypass the host operating system in which SDN network function are running (vswitch and vrouter).
In coming sections, we are describing some evolution on both VirtIO and direct device assignment in order to provide a solution that:
-
is running in user space, like proposed by DPDK
-
with hardware performance, like proposed by SRIOV and direct physical device assignment
-
features rich to be used in SDN, like proposed by VirtIO software solution.
Virtio full offloading
With virtio full hardware offloading, both the virtio data plane and virtio control plane are offloaded to the NIC hardware. The physical NIC must support:
-
the virtio control specification: discovery, feature negotiation, establishing/terminating the data plane.
-
the virtio dataplane specification: virtio ring layout.
Hence once the guest memory is mapped with the NIC using virtio physical device passthrough, the guest communicates directly with the NIC via PCI without involving any specific drivers in the host kernel.
Guest VM packet processing is directly performed in NIC hardware like but presented to the guest instance like a regular virtio emulated interface. Guest VM does not make any difference between a virtio emulated interface and an assigned physical virtio NIC, as they are exposed with the same virtio driver frontend in the guest.
virtio device passthrough
Virtio device passthrough can be implemented onto a NIC which is supporting or not SR-IOV.
Like other physical device assignment technics presented in this book, VFIO and IOMMU are used to present the physical device NIC into the guest VM user space.
Hence, such a virtio physical NIC can be used by a DPDK application running into a virtual instance. But, like other virtio device passthrough has also the same limitations for SDN. As the host operating system is totally by passed by this mechanism, we cannot interconnect instances using such NIC interface with a SDN virtual router or switch.
The main advantage of Virtio device passthrough is the flexibility it provides for a virtual instance to use transparently either a real physical interface or an emulated one. It offers an Open public specification, which provide device fully independent of any specific vendor.
Virtio full HW offloading, can support live migration thanks to virtio, which is not possible to achieve without any specific implementation with SR-IOV.
But in order to be able to support such a feature, latest virtio specifications (1.1 version) must be implemented onto both QEMU and the NIC hardware used on the cloud infrastructure.
Virtio Datapath Acceleration
Like full hardware offloading, virtual Data Dath Acceleration (vDPA) aims to:
-
standardize the physical data plane using the virtio ring layout
-
present a standard virtio driver in the guest decoupled from any vendor implementation for the control path
vDPA is presenting a generic control plane through a software piece which provides an abstraction layer on top of physical NIC.
Like Virtio full hardware offloading, vDPA build a direct data path between the gest network interface and the physical NIC, using the virtio ring layout. But for the control path a generic vDPA driver (mediation driver) is used to translate the vendor NIC driver/control-plane to the VirtIO control plane, in order to allow each NIC vendor to keep using its own driver.
It allows NIC vendors to support virtio ring layout at smaller effort keeping wire speed performance on the data plane.
virtio datapath acceleration
vDPA is requiring a vendor specific "mediation device driver" to be loaded in the host operating system.
Smart NIC
A NIC card generation commonly named "smart NIC" are highly customizable thanks to the last evolution provided by some new capabilities (FPGA, P4).
It makes possible to envisage SDN vSwitch/vRouter dataplane function to be moved into the NIC card keeping only the controle plane function into the host operating system.
For Contrail solution, this is made by offloading several Contrail vRouter tables including:
-
Interface Tables
-
Next Hop Tables
-
Ingress Label Manager (ILM) Tables
-
IPv4 FIB
-
IPv6 FIB
-
L2 Forwarding Tables
-
Flow Tables
It allows to accelerate lookups and forwarding actions that are directly performed into the NIC.
SDN packet processing is fully done into the NIC card, no more host CPU processing is involved in packet processing.
Two implementations are proposed by Metronome:
SRIOV + SmartNIC:
vDPA + Smart NIC:
eBPF and XDP
Berkeley Packet Filter (BPF) was designed for capturing and filtering network packets that matched specific rules. In last years extended BPF (eBPF) has been designed to take advantage of new hardware (64 bits usage for intance). An eBPF program is "attached" to a designated code path in the kernel.
eXpress Data Path (XDP), uses eBPF to achieve high-performance packet processing by running eBPF programs at the lowest level of the network stack, immediately after a packet is received. XDP.
XDP support is made available in the Linux Kernel since version 4.8, while eBPF is supported in the Linux Kernel since version 3.18.
XDP requires:
-
MultiQ NICs
-
Common protocol-generic offloads:
-
TX/RX checksum offload
-
Received Side Scaling
-
Transport Segmentation offload (TSO)
-
XDP packet processor performs:
-
In Kernel RX packets processing
-
Process RX packets directly (without any additional memory allocation for software queue, nor socket buffer allocation)
-
Assign one CPU to each RX queue. This CPU can be configured into poll mode or interrupt mode.
-
Trigger BPF program for packet processing
BFP programs:
-
parse packets
-
perform table lookup
-
manage stateful filters
-
manipulate packets (encapsulation, decapsulation, NAT, …)
BFP program main actions are :
-
Forward
-
Forward after modification (NAT)
-
Drop
-
Normal receive (regular Linux packet processing with socket buffer and TCP/IP stack)
-
Generic Receive Offload (coalesce several received packets of a same connection
XDP is also able to offload an eBPF program to a NIC card which supports it, reducing the CPU load.
XDP and eBPF does not require:
-
to allocate large pages
-
to allocate dedicated CPUs
-
to choose packet polling or interrupt driven networking model
-
user space to kernel space context switching to perform eBPF filtering
-
allow packet processing offload when supported by used NIC card
PS: eBPF rules are also supported in DPDK application.
NIC virtualization solutions summary
We’ve seen lots of NIC virtualization models for virtual instances. From a full software implementation like proposed by VirtIO to fully hardware assisted solution like proposed by SR-IOV. Also, DPDK is providing the ability to move NIC packet processing from Kernel space to user space.
In the diagram below we are providing an overview of NIC virtualization solution:
-
Fully software solutions are very flexible and fits well with SDN and Cloud feature expectation (Live migration, east-west traffic inside host computes)
-
Hardware assisted solutions are very performant but fit less with expected virtualization flexibility. Guest VM migration is poorly supported due to hardware dependencies. These solutions fit well with application requiring a huge north-south traffic (from Guest WM to cloud outside).
In the middle, SmartNIC and DPDK are offering the best compromise for a SDN usage. Smart NIC are proposing very high performance, but this is still not a fully mature solution (lots of implementations vendor specific, no agreed standard).
(*): depends on hardware and QEMU latest virtio specification support on the NIC card.
Chapter 3: Contrail DPDK vRouter architecture
Contrail Software Stack
Contrail is a SDN platform which provides virtual networking mainly for overlay workloads like Virtual machines and Containers. It consists of two components:
-
Contrail controller
-
Contrail vRouter
Contrail Controller is a logically centralized but physically distributed SDN controller that is responsible for providing the management, control, and analytics functions for the whole cluster.
This picture shows the high-level description of the contrail architecture.
At the top, there is an orchestrator which can be Openstack or Kubernetes. Below that, there are controller components like control node, config node and analytics node. At the bottom right is the compute node. The compute nodes are general purpose x86 servers which will be the main focus of this chapter.
Contrail compute node
This picture shows a more detailed view of the compute node. This is the place where vRouter runs. It is the most important component of the contrail dataplane. We can see some workloads running. The workloads can be either virtual machines or containers. These workloads have their interfaces plumbed into the vRouter.
At a high level, vRouter forms dynamic overlay tunnels with other workloads running on the same or different computes to send and receive data traffic. Within the server, it switches the packets between the VM interfaces and physical interfaces after doing the required encapsulations or decapsulations. Currently, the encapsulations supported to vRouter are MPLS over UDP (MPLSoUDP), MPLS over GRE (MPLSoGRE) and VXLAN. Each of these workloads have a corresponding forwarding state or routing instance inside vRouter which it uses to switch the packets. The physical interface that is connected to the Top-or-rack switch can be single or bonded mode.
The vRouter itself can be running either as a linux kernel module or as a userspace DPDK process. There is a vRouter agent process also running in user space. The agent has a connection to the controller using a XMPP channel which is used to download configurations and forwarding information. The main job of the agent is to program this forwarding state to vRouter forwarding plane.
vRouter architecture
vRouter is the workhorse of the Contrail system. Each and every packet to and from the contrail cluster goes through vRouter. vRouter is highly performant, efficient and has the capability to process millions of packets per second. It is multi-threaded, multi-cored and multi-queued to achieve maximum parallelism and exploit the x86 hardware to the maximum extent.
To support the rich and diverse features, vRouter has a sophisticated packet processing pipeline. The same pipeline can be stitched by the vRouter agent process from the simplest to the most complicated manner depending on the treatment which needs to be given to a packet. vRouter maintains multiple instances of forwarding bases and all the table accesses and updates use RCU (Read Copy Update) locks which is kind of lockless.
vRouter and it’s interfaces
The picture below describes the vRouter and its interfaces to the outside world. It has interfaces to each of the workloads (VM1, VM2.. VMn) that it manages. These are typically tap interfaces.
To send packets to other physical servers or switches, it uses the physical interfaces. They can be single or bonded NIC. vRouter is only interested in overlay packets or the packets to/from the workloads. For other packets, it uses the linux interface to send them to the host operating system.
This Linux interface is called vhost0. It also has netlink interfaces toward the vRouter agent to download the forwarding state and also to send/receive some exception packets. The name of the later is called pkt0 interface.
The below is the sample output from “vif --list” command which gives the list of all vifs that are configured on a compute node:
[root@a7s3 ~]# vif --list Vrouter Interface Table Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2 D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged Mnp=No MAC Proxy, Dpdk=DPDK PMD Interface, Rfl=Receive Filtering Offload, Mon=Interface is Monitored Uuf=Unknown Unicast Flood, Vof=VLAN insert/strip offload, Df=Drop New Flows, L=MAC Learning Enabled Proxy=MAC Requests Proxied Always, Er=Etree Root, Mn=Mirror without Vlan Tag, HbsL=HBS Left Intf HbsR=HBS Right Intf, Ig=Igmp Trap Enabled
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4 Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0 Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:16 RX device packets:14117825256 bytes:2456433542438 errors:0 RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe Vlan Id: 101 VLAN fwd Interface: vfw RX packets:7058889673 bytes:1199976475061 errors:0 TX packets:7059332226 bytes:1200700918913 errors:0 Drops:392133 TX device packets:14119406674 bytes:2457969960530 errors:0
vif0/1 PMD: vhost0 NH: 5 Type:Host HWaddr:90:e2:ba:c3:af:20 IPaddr:8.0.0.4 Vrf:0 Mcast Vrf:65535 Flags:L3DEr QOS:-1 Ref:13 RX device packets:815137 bytes:780115621 errors:0 RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 RX packets:815137 bytes:780115621 errors:0 TX packets:873131 bytes:162620313 errors:0 Drops:12 TX device packets:873131 bytes:162620313 errors:0
vif0/2 Socket: unix Type:Agent HWaddr:00:00:5e:00:01:00 IPaddr:0.0.0.0 Vrf:65535 Mcast Vrf:65535 Flags:L3Er QOS:-1 Ref:3 RX port packets:135922 errors:0 RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 RX packets:135922 bytes:11689292 errors:0 TX packets:36432 bytes:3198966 errors:0 Drops:0
vif0/3 PMD: tap41a9ab05-64 NH: 32 Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104 Vrf:3 Mcast Vrf:3 Flags:PL3L2DEr QOS:-1 Ref:12 RX queue packets:7057651439 errors:7736 RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 7736 0 RX packets:7057833621 bytes:875156312738 errors:0 TX packets:7057123054 bytes:875068202430 errors:0 ISID: 0 Bmac: 02:41:a9:ab:05:64 Drops:7947
vif0/4 PMD: tapd2d7bb67-c1 NH: 29 Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.0.104 Vrf:2 Mcast Vrf:2 Flags:PL3L2DEr QOS:-1 Ref:12 RX queue packets:782831 errors:0 RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0 RX packets:799687 bytes:81599398 errors:0 TX packets:1110661 bytes:85243244 errors:0 ISID: 0 Bmac: 02:d2:d7:bb:67:c1 Drops:1665
The different types of interfaces listed here are:
-
vif0/0 - Represents the underlay NIC card (usually a Linux bond interface).
-
vif0/1 – Represents the interfact to the linux operating system (vhost0)
-
vif0/2 – Represents the interfact to the vRouter agent (pkt0).
-
vif0/3 and higher – Represents the Virtual Machine Interfaces (VNIC).
vRouter packet processing Pipeline
The vRouter packet processing pipeline is described in the picture below.
There are various tables and engines in action in this pipeline. Some of the important tables in this pipeline are flow table, route table, NH table and the MPLS/VXLAN table. The vRouter agent programs these tables based on the forwarding state it receives from the control node and also based on its own internal processing. Each packet, depending on which interface it is coming from, is subjected to the desired processing.
At a high level, all packets enter from an interface called ‘vif’. The vifs are nothing but one of the vRouter interfaces that we described previously. Example: tap interface, physical interface, vhost0 interface, agent interface etc. Depending upon the configuration of that interface, it enters different pipeline stages, doing lookups in different tables and based on what actions are defined in each stage, the packets are modified accordingly.
At the end of the processing, it is sent to another vRouter interface or vif after encapsulation or decapsulation. This is a fairly generic pipeline and the agent stitches this based on the rich feature set that the contrail cluster is configured.
Another important aspect of vRouter is that of forwarding modes. The vRouter can work in two modes - flow mode (bottom pipeline on the diagram above) or packet mode (top on the diagram above). By default, Contrail works in flow mode. This means that, vRouter keep track of every single flow traversing it. Depending on the flow action, it can either forward the packet or drop it. In the packet mode, vRouter bypasses the flow table and directly uses the nexthop to treatment that needs to be given to the packet. Example: If the nexthop is tunnel next hop, the packet is encapsulated in a tunnel header and forwarded into an outgoing interface.
vRouter deployment methods
Contrail supports three kinds of vRouter deployments:
Linux Kernel
In this method of deployment, vRouter is installed as a kernel module (vrouter.ko) inside the Linux operating system. This is the default installation mode when configuring a compute node. vRouter registers itself with the linux TCP/IP stack to get packets from any of the linux interfaces. It uses the netdev_rx_handler_register() API provided by linux for this purpose. The interfaces can be bond, physical, tap (for VMs), veth (for containers) etc. It relies on linux to send and receive packets from different interfaces. Example: Linux exposes a tap interface backed by vhost-net driver to communicate with VMs. Once vRouter registers for packets from this tap interface, the linux stack sends all the packets to it. To send a packet, vRouter just has to use regular linux APIs like dev_queue_xmit() to send the packets out on a linux interface.
NIC queues (either physical or virtual) are handled by Linux Operating system.
With respect to the packet processing performance, the tuning has to be done at Linux Operating System level.
Here, packet processing is works in interrupt mode. This mode generates interrupts resulting in lot of context switches. When the packet flow rate is low, this is works well. But as soon as the packet rate starts increasing, the system gets overwhelmed with the number of interrupts generated resulting in poor performance.
DPDK
In this mode, vRouter runs as a user space application that is linked to the DPDK library. This is the performant version of vRouter that is commonly used by telcos, where the VNFs themselves are DPDK based applications. The performance of vRouter in this mode is more than 10 times higher than the kernel mode. The physical interface is used by DPDK’s poll mode drivers (PMD) instead of linux kernel’s interrupt-based drivers. A user-IO (UIO) Kernel module like vfio or uio is used to expose network interfaces registers into user space so that they are reachable by DPDK PMD. When a NIC is bound to UIO driver, it is moved from Linux kernel space to user space and therefore no more managed nor visible by the linux operating system. Consequently, it is the DPDK application (which is the vRouter here) that fully manages the NIC. This includes packets polling, packets processing and packets forwarding. No more action is taken by the operating system. All user packet processing steps are performed by the vRouter DPDK dataplane.
The nature of this “polling mode” makes the vRouter DPDK dataplane packet processing/ forwarding much more efficient as compared to the interrupt mode, which is used by linux kernel, when packet rate is high. There are no interrupts and context switching during packet IO.
|
Note
|
When the network packet rate is low, this way of working could be less efficient than the regular Kernel mode. In DPDK mode, a set of CPUs are fully dedicated for packet processing purpose and always polling even in the absence of packets. If the network packets rate is too low, lot of CPU cycle are unused and wasted. However, there is an inbuilt optimization technique which gets kicked in which yields the CPU for a small amount of time when there are no packets in the previous polling interval. |
Finally, since DPDK vRouter does not require any support from linux kernel, it needs to be heavily tuned to get the best packet processing performance.
In this chapter we’ll mainly focus on the architecture of DPDK vRouter.
SmartNIC
In this mode, the Contrail vRouter runs inside the SmartNIC itself. This means, host resources are not involved in packet processing. It saves the CPU resources that will be used by vRouter for packet processing. Since all the packet processing is done by the NIC hardware, the performance is the best compared to the previous two types of deployments.
Currently, contrail offers solutions with Smart NICs from Netronome and Mellanox. At the time of writing of this book, a solution based out of Intel PAC N3000 smart NIC support was being worked on.
DPDK vrouter architecture
DPDK vRouter software architecture
DPDK vRouter is a userspace application as mentioned previously. It is comprised of multiple pthreads, which are also called lcores (logical cores) in DPDK terminology. Each pthread has a specific role to perform. The lcores run in a tight loop, also called the poll mode. They can exchange packets among themselves using DPDK queues. Each lcore has a receive queue which can be used by other lcores to enqueue packets which needs to be processed by that lcore. They also poll different vRouter interfaces queues like - physical, VM and tap.
DPDK vRouter and lcores
vRouter is a multi-threaded user-space application. It spawns several pthread or lcores which run in a tight ‘while’ loop. Each lcore is responsible for a specific task. The different types of lcores are:
-
Forwarding lcores
-
Service lcores
-
Tapdev lcore
-
Timer lcore
-
Uvhost lcore
-
Packet (Pkt0) lcore
-
Netlink lcore
-
Forwaring lcores
Forwarding lcores are responsible for polling the physical and virtual interfaces. Physical interfaces can be a bonded interface also. In addition, they can do the vrouter packet processing which is briefly illustrated in the section “vRouter packet processing Pipeline”. These lcores can assume the role of both polling and processing.
These lcores are spawned by the vRouter with a well-defined CPU list. It gets the CPU list as a “core mask” using the “taskset” linux command.
Example: taskset 0x1e0 /usr/bin/contrail-vrouter-dpdk --no-daemon
The hex representation of 0x1e0 is as follows –
| CPU Number | 5 | 4 | 3 | 2 | 1 | 0 |
|---|---|---|---|---|---|---|
Bit value |
0 |
1 |
1 |
1 |
1 |
0 |
This will make the vRouter spawn 4 forwarding cores and they will be pinned to CPUs 1,2,3,4
The first forwarding is named lcore10, the next one is named lcore11, and so on. Hence if a DPDK vRouter has been configured with 4 polling and processing CPU into its CPU list, 4 threads will be launched: lcore10, lcore11, lcore12 and lcore13.
The below is the output which lists the threads running in vRouter, it’s names and also it’s PIDs
[root@a7s4 ~]# ps -T -p $(pidof contrail-vrouter-dpdk) PID SPID TTY TIME CMD 3685 3685 ? 03:47:37 contrail-vroute Main thread and tuntap lcore 3685 3800 ? 00:04:32 eal-intr-thread DPDK control thread 3685 3801 ? 00:00:00 rte_mp_handle DPDK control thread 3685 3802 ? 04:55:48 lcore-slave-1 Timer lcore 3685 3803 ? 00:00:02 lcore-slave-2 uvhost lcore 3685 3804 ? 00:00:11 lcore-slave-8 Packet (pkt0) lcore 3685 3805 ? 00:04:12 lcore-slave-9 netlink lcore 3685 3806 ? 6-16:39:37 lcore-slave-10 forwarding thread #1 3685 3807 ? 6-16:40:48 lcore-slave-11 forwarding thread #2 3685 3808 ? 6-16:35:35 lcore-slave-12 forwarding thread #3 3685 3809 ? 6-16:37:52 lcore-slave-13 forwarding thread #4 3685 5048 ? 00:00:00 lcore-slave-9 fork of netlink core (for client mode qemu)
Packet processing models in dataplane software
There are three types of packet processing models which any multi-threaded dataplane application follows:
-
Run-to-completion model
-
Pipeline model
-
Hybrid model
In run-to-completion model, the software does not have multiple stages and it does the entire processing in a single context or single stage. There are no FIFOs here ensuring that latency overheads are less.
In the pipeline model, the software is divided into multiple stages. Each stage completes part of the processing and hands it over to the next stage and so on. The way to handover is using a FIFO buffer between the stages. These buffers introduce latency. But the main advantage of this model is that it ensures even load balancing of all stages in the event when only a few stages are loaded than others.
Contrail vRouter uses a hybrid model where it uses pipelining model in some scenarios and run-to-completion model in other scenarios. This ensures good load balancing of all lcores with a reasonable latency. It needs to have FIFOs due to pipelining.
The different ways of packet processing done by the vRouter is as follows:
-
Run-to-completion: A forwarding lcore polls for packets from a vif Rx queue. Then it performs the vRouter packet processing and determines the encap/decap that needs to be done. It also finds which outgoing vifs the modified packets needs to be sent. Finally, it sends them on those outgoing vif Tx queues.
-
Pipeline: A forwarding lcore polls for packetets from a vif Rx queue. It then distributes these packets to other forwarding lcores using the DPDK software rings between them. The other forwarding lcores pick up the packets and performs packet processing. Then they send the modified packets to other vif Tx queues.
vRouter uses Run-to-completion model in one or more of these scenarios:
-
The option “--vr_no_load_balance” is configured
-
The packets coming on the NIC from the wire have MPLSoUDP encapsulation
-
The packets coming on the NIC from the wire have VXLAN encapsulation
vRouter uses Pipeline mode in one or more of these scenarios:
-
The packets coming on the NIC from the wire have MPLSoGRE encapsulation
-
The packets are received by the vRouter from the Workloads (VMs or containers)
-
The option “--vr_no_load_balance” is turned off
Service lcores
Service lcores are responsible for handling all vRouter interfaces other than workload (VM) interfaces and physical interfaces. It also hands other book-keeping and miscellaneous tasks for vRouter like timer management and vhost-user control path. By default, they are not pinned to any physical CPU.
Most of the service lcores make use of user sockets to talk to other processes in the system like agent, qemu (VM) and linux stack.
User sockets in vRouter
User socket (usocket) is an object where IO happens. While it can represent non-socket objects too (like an eventfd), most consumers are socket users and hence usocket is primarily a socket.
A socket, when used for IO, has to have a protocol to understand the format of the data that enters and exits it. vRouter DPDK has three protocols –
-
NETLINK
-
PACKET
-
EVENT
A NETLINK socket carries Netlink messages i.e.: each message in the socket will have a a netlink header
A PACKET socket carries packets that have a agent_hdr. A PACKET socket has a ring, a vif, and a child usocket that represents an eventfd that is written by the datapath threads to wake up the packet thread whenever there are new packets that are enqueued on the ring.
The EVENT protocol represents an eventfd. We can write an 8 byte value that will be accumulated over writes to be read by the reader. This is used as a wakeup mechanism for one or more threads.
For each of the protocol, multiple transport types could make sense. For example - for a NETLINK socket, both a TCP and a UNIX transport could make sense. However, for a packet socket, only a RAW transport will make sense.
Tapdev lcore
vRouter implements a custom PMD for tuntap devices which can be used to send and receive packets between vRouter and the linux host OS. It is a replacement for DPDK KNI PMD. Currently “vhost0” and “monitoring” interfaces (used by vifdump utility which is explained later) make use of it.
When a tap device is initialized, vRouter uses the “tun” driver (/dev/net/tun) in linux and creates a tuntap device.
[root@a7s3 ~]# ethtool -i vhost0 driver: tun version: 1.6 firmware-version: expansion-rom-version: bus-info: tap supports-statistics: no supports-test: no supports-eeprom-access: no supports-register-dump: no supports-priv-flags: no
When the netlink communication channel between the agent and vRouter DPDK has been setup using the netlink lcore, agent sends a message to the vRouter DPDK to add the vhost0 interface. As part of this sequence, a new vhost0 vif or vif0/1 is created and is setup so that the tapdev lcore is responsible for polling the vhost0 interface. In each iteration, the PMD uses raw “read” and “ write” socket calls to receive and transmit packets to the tuntap device.
Receiving packets from vhost0
One of the forwarding cores will be assigned to process the “vhost0” packets and will be polling a dedicated DPDK ring, called the “tapdev_rx_ring”. This ring will be added to the forwarding lcore’s poll list when the vhost vif is added by the vRouter agent. The tapdev PMD will receive packets from the vhost0 interface using the “read()” socket call and enqueue them to the above mentioned DPDK ring. The designated forwarding core then picks these packets and processes it.
Sending packets to vhost0
All the forwarding cores will have Tx rings for vhost0. Packets that needs to be sent to vhost0 will be enqueued to these Tx rings by the lcores. The tapdev PMD polls these Tx rings and dequeues the packets from these rings. It then sends the packets to the “vhost0” interface using the write socket call.
Timer lcore
Netlink lcore
Netlink lcore is responsible for establishing a communication channel with the agent for programming the forwarding state (like routes, nexthops, labels etc.). It creates a unix server socket at “/var/run/vrouter/dpdk_netlink” to which the agent connects.
(vrouter-agent-dpdk)[root@a7s4-kiran /]$ netstat -anp | grep dpdk_netlink unix 2 [ ACC ] STREAM LISTENING 46105 3728/contrail-vrout /var/run/vrouter/dpdk_netlink unix 3 [ ] STREAM CONNECTED 4952631 3728/contrail-vrout /var/run/vrouter/dpdk_netlink
(vrouter-agent-dpdk)[root@a7s4-kiran /]$ ps -eaf|grep 3728 root 3728 2551 99 Oct02 ? 210-14:44:48 /usr/bin/contrail-vrouter-dpdk --no-daemon --socket-mem 1024 --vlan_tci 101 --vdev eth_bond_bond0,mode=4,xmit_policy=l34,socket_id=0,mac=00:1b:21:bb:f9:48,lacp_rate=0,slave=0000:02:00.0,slave=0000:02:00.1
The first line of the output shows the state as “LISTENING” for DPDK vRouter which indicates that it is a server and is waiting for clients such as agent to connect to it. The second line shows the agent connected to it and so the state is “CONNECTED”.
The protocol that is carried in this socket is “NETLINK” which means all messages have netlink header which is of size 24 bytes followed by the payload. The socket type is “UNIX”. The netlink header comprises of the following –
-
Netlink message header
-
Generic netlink message header
-
Netlink attribute
The header can be viewed easily using gdb to the DPDK vRouter
(gdb) ptype struct nlmsghdr Netlink message header
type = struct nlmsghdr \{
unsigned int nlmsg_len;
unsigned short nlmsg_type;
unsigned short nlmsg_flags;
unsigned int nlmsg_seq;
unsigned int nlmsg_pid;
}
(gdb) ptype struct genlmsghdr Generic netlink message header
type = struct genlmsghdr \{
__u8 cmd;
__u8 version;
__u16 reserved;
}
(gdb) ptype struct nlattr Netlink attribute
type = struct nlattr \{
__u16 nla_len;
__u16 nla_type;
}
(gdb) p sizeof(struct nlmsghdr) + sizeof(struct genlmsghdr) + sizeof(struct nlattr)
$1 = 24
The payload of this message is in “Sandesh” format. This is a proprietary data format (similar to XML) used by the agent and vRouter. The format of this is –
Object name |
Type |
Type |
….. |
Type |
The object name specifies the type of object the message contains - like nexthop, route, mpls etc.
Type can be fixed length datatypes like uint8, uint16, uint32. It can also be variable length datatypes like “list” in which case there will be a “length” field to specify the length of the list.
These messages are parsed by inbuilt parser and appropriate callbacks called depending on the objects. Example: For a nexthop object, the nexthop callback within vRouter is called which in-turn programs that nexthop in the nexthop table.
If the vRouter needs to return a status or error message to the agent after processing of the Sandesh object, it can do so. That way, the agent gets to know if the programming was successful or not.
Packets to be sent between vRouter agent and Contrail Control nodes are forwarded using the same packet processing principles than the user packets.
vRouter DPDK dataplane polling and processing cores are used to forward XMPP packets between vif0/1 (vRouter agent vhost0 listening interface) and vif0/0 (connecting to the underlay infrastructure on which Contrail Control nodes are plugged).
Contrail DPDK vRouter packets processing
Packets polling and processing
During initialization of the NIC interface, vRouter configures it with the same number of queues as the number of forwarding cores it has. For example, if the vRouter has 5 forwarding cores, the number of Rx queues it configures to the NIC is 5.
A vif queue is made up of two DPDK rings:
-
one RX ring: in which are stored packets received from a NIC to be processed by the vRouter
-
one TX ring: in which are stored packets to be sent by the vRouter to a NIC
Packets stored in vif RX rings are polled by a forwarding lcore. There is a one-to-one mapping between forwarding cores and the NIC’s Rx queues. Then the polled packets are processed by a the same lcore or different one and then pushed to a target vif’s TX ring.
Each lcore10 and higher started by a DPDK vRouter is a polling and a processing thread. They are running onto a single CPU listed defined by CPU_LIST variable.
MPLS over GRE overlay
Incoming overlay encapsulated packets are received on the Compute physical Network Interface Card, usually a Bond made up of 2 NICs, used for user packets transport.
Incoming Overlay packets are placed into Physical NIC queues using DPDK RSS (Received Side Scaling) hashing algorithm. A vRouter startup are created (with the help of the physical NIC PMD) as many DPDK queues (both RX and TX rings) as the number of vRouter allocated polling and processing cores.
The RSS hashing algorithm for MPLSoGRE only use 3 tuple values: IP source, IP destination, protocol number. Unfortunately, the entropy of these 3 values is low when GRE is used.
Indeed, the 3 tuple values is kept the same between 2 different compute nodes.
All packets coming from different Virtual Machines located on a same compute node will be bound to the same 3 tuple value. Hence, the hashing algorithm will provide an identical value for all network flows coming from each single compute.
Consequently, all packets coming from Virtual Machines located on a same compute will be received only in one DPDK RX ring of the vif0/0 interface (vRouter interface connected to the underlay network).
So, incoming MPLS GRE overlay packets are not well balanced onto the different polling and processing threads (lcores) the vRouter is fitted with. Therefore, when MPLS GRE overlay is used, it has been chosen to perform the packet processing (packet transformation and delivery in a vif TX ring) onto a different lcore than those used for the packet polling (retrieve a packet from a vif0/0 RX ring).
A DPDK pipeline model is then used. A first lcore is only performing packet polling, a second one is performing the packet processing. Some internal queues are setup in order to store packets that have been polled by the polling lcore thread and that are waiting to be processed by the processing lcore thread.
A hash algorithm is applied onto the decapsulated packet (inner packet) in order to select one of the internal queues that are each handled by a single processing lcore thread.
Thanks to this mechanism, even if few compute nodes are used into the physical infrastructure and user packets carried with MPLS over GRE overlay protocol, all vRouter allocated CPU are used.
UDP overlay (VxLAN or MPLS over UDP)
When an UDP overlay protocol is used (MPLS over UDP or VxLAN) we have a better entropy, 5 tuples from values: IP source, IP destination, source port and destination port, protocol. Indeed, even if few computes are used, the sending compute can create diversity using some distinct values in the UDP source port of overlay packet.
Different network flows coming from a same virtual remote machine will generate different RSS hash results.
Consequently, incoming overlay packets are balanced onto all the DPDK RX rings configured for the physical interface. It is useless to split polling and processing steps. Therefore, when an UDP overly protocol is used to transport user packets between compute nodes, the vRouter is using the same lcore for both polling and processing steps of each packet.
It is more efficient to use UDP overlay protocols. Performance reached with a same DPDK vRouter configuration is higher when an UDP overlay protocol is chosen instead of MPLS over GRE.
Single Queue versus Multi-Queue NIC
NIC connected to vRouter can be configured to several queues.
Each NIC queue is automatically pinned to a single vRouter polling and processing thread (lcore10 and higher). Consequently, when a NIC is configured with only a single Q, all incoming and outgoing packets are processed by a single vRouter polling and processing threads.
In order to avoid binding all single queue interfaces on the same polling and processing thread; each interface queue is pinned to a distinct vRouter lcore into a round robin manner when each interface is created. Single queue vif0/3 is automatically pinned to lcore 10, single queue vif0/4 is automatically pinned to lcore 11, and so on.
Hence the vRouter whole CPU power is automatically distributed among all the single queue interfaces. This distribution is automatically defined for each interface and is kept unchanged during all the interface life duration.
When a NIC is configured with several queues, each single queue is bound to a distinct polling and processing thread (lcore). Hence the vRouter whole CPU power is automatically distributed among all the queues on each multi queue interface.
Even if there is no hard rule that prevent a user to configure a different number of queue on a NIC as the number of lcores (polling and processing threads) configured on the vRouter; the best scenario is to configure each multi queue NIC with the same number of queue as the number of configured polling and processing threads on the vRouter.
|
Note
|
We also have to take into consideration that currently the DPDK vRouter is unable to process correctly a multi queue vNIC which is configured with more queues than the number of polling and processing threads configured on the vRouter. |
Supported scenarios
Contrail DPDK vRouter is able to collect DPDK virtual machines as well as Linux Kernel packet processing virtual machines. Likewise, a contrail Kernel vRouter is also able to collect both DPDK and non DPDK virtual machines.
But only two scenarios really make sense:
-
Kernel mode vRouter collecting Kernel mode virtual machines
-
DPDK vRouter collecting DPDK virtual machines
In the Kernel scenario, both Virtual Machines and Contrail vRouter work with a regular Linux TCP/IP stack using interrupt mode packet processing. They both suffer the same limitations (packet processing does not scale due to interrupt mode) and the same advantages (it does not require to book lots of CPU for packet processing). So this scenario is best to be used when the virtual machines do not expect a high network connectivity performance.
In the DPDK scenario, both Virtual Machines and Contrail vRouter work with a DPDK library using poll mode packet processing. They both suffer from the same limitation (poll mode requires to book some CPUs for packet processing) and have the same advantages (it allows to reach line rate packet processing). This scenario is the best to be used when the virtual machines require a high network connectivity performance. Typically, Virtual Network Functions (VNF).
Hybrid cases are unsuitable. When a Kernel mode Virtual Machine is plugged onto a Contrail DPDK vRouter, it impacts very badly the whole Contrail vRouter and VNF performance. Indeed, Contrail DPDK vRouter has to emulate interrupt mode using KVM features in order to kick the Virtual machine. It involve a “VMExit” which is like a system call to the hypervisor and takes lots of CPU cycles. This not only impacts the Kernel Mode VM but all the other DPDK VMs as well.
A DPDK Virtual machine plugged onto a Contrail Kernel mode vRouter is also very inefficient. Even if the Virtual machine is able to process its network packets at a very high speed, Linux Kernel packet processing used by Kernel mode vRouter does not scale well. So, at the end lots of packets generated by a high speed VNF plugged on a Contrail Kernel mode vRouter could be lost.
This is why Contrail users have to be consistent and to plug DPDK Virtual machines onto DPDK dataplane vRouter and Kernel mode Virtual machines onto Kernel mode dataplane vRouter.
When virtual infrastructure is made up of several kinds of virtual machines (both DPDK and not DPDK ones), placement strategy have to be defined in order to spawn DPDK VM onto computes fitted with Contrail DPDK vRouter and to spawn non DPDK VM onto computes fitted with Contrail Kernel mode vRouter.
chapter 4: contrail networking and test tools installation
In previous chapters, we have gone through most important topics about SDN and DPDK in general, DPDK vRouter architectures, vRouter packet processing details and so on. When you read these topics you may wonder how to get a running contrail networking environment with a few DPDK vRouters in it, so you can play around, test those theories and familiarize yourself about what you’ve learned. Indeed, those topics are important, unfortunately they are by themselves not so straightforward, so even after we’ve put great effort to illustrate, some of them may still sound confusing, especially when you get down to the implementation details.
In this chapter, we will mostly focus on the hands-ons and lab testings to verify some of the most important DPDK vRouter concepts and working mechanisms.
-
We’ll start from introducing steps we’ve used to install a latest version of contrail networking cluster.
-
On top of it, we start to build a testing environment. That includes a few VMs running OPENFV PROX software. On each VM, based on its role, the PROX software is configured as either a traffic generator or a traffic receiver.
-
We’ll go ahead to introduce some of the commonly used DPDK tools, scripts and log entries that provides useful information to help us understand how things run in DPDK environment.
-
In the end, we’ll go over some case studies. We use PROX and rapid we’ve installed to start different traffic patterns in our setup, and then use DPDK tools to analyze what we are seeing.
After reading this chapter, you will have a deeper and concrete understanding to the some of the main concepts we’ve covered in this book. We’ll start with the contrail installation.
contrail installation
contrail installation methods
In this book, we’ve been focusing on DPDK vRouter that runs in each individual compute node, which basically runs in a relatively standalone mode. but if you look at forwarding plane as a whole, they are actually a distributed system. In fact as we’ve briefed in chapter 1, the whole TF cluster is a complex distributed system involving a lot more different software modules especially in control plane. Again, each of the software module can be a completely different distributed system by themselves. The cassandra database that TF cluster uses is one such example. Explaining and understanding details about how things works in distributed system is never easy, and so is the installation process. It won’t be a surprise if you run into some installation issues in your lab. Generally speaking, it is always much more efficient to follow a detailed, verfied process with step by step instructions to "avoid" the issues, than starting with a "try-and-see" mode and then try to fix the issues.
Currently, TF cluster has been widely integrated with many majority deployment systems and platforms. Therefore depending on your envrionment there can be totally different ways of installing contrail system. Here is a incomplete list of currently supported installation methods:
-
Installing Contrail with OpenStack and Kolla Ansible
-
Installing Contrail with RHOSP
-
Installing Kubernetes Contrail Cluster using the Contrail Command UI
-
Installing and Provisioning Contrail VMware vRealize Orchestrator Plugin
-
Installing a Standalone Red Hat OpenShift Container Platform 3.11 Cluster with Contrail Using Contrail OpenShift Deployer
-
Installing a Nested Red Hat OpenShift Container Platform 3.11 Cluster Using Contrail Ansible Deployer
-
Installing Contrail with OpenStack or kubernetes by Using Juju Charms
For example, the second method, you can install contrail with Redhat openstack
platform director 13 (RHOSPd), which is a toolset based on the OpenStack
project TripleO (OOO, OpenStack on OpenStack). An TF environment built out of
RHOSPd uses concepts of undercloud and overcloud. Basically undercloud is
a single server containing complete OpenStack components, whose role is just to
deploy and manage an overcloud, which is a tenant-facing environment that
hosts the "resulting" openstack and TF nodes. This deployment is currently used
by quite a few major service providers in production.
However, the installation process of such a deployment involves the understanding of RHOSPd, TripleO, and of lots of different types of networks isolation topologies, which add many uncessary complexities to our lab setup. In this section, we’ll give a detail steps about the first method - installing contrail with openstack and kolla ansible.
Kolla is an OpenStack project which provides tools to build container images for OpenStack services. Kolla Ansible provides Ansible playbooks to deploy the Kolla images. The contrail-kolla-ansible playbook works in conjunction with contrail-ansible-deployer to install OpenStack and Contrail Networking containers.
cluster diagram
re-image servers
TODO: some text here
configure bond and vlan
To enable bond interface in centos, under /etc/sysconfig/network-scripts/ of
all nodes where bond interface is needed, add these configuration files:
| bond | members | ||||
|---|---|---|---|---|---|
|
|
||||
|
|
Then restart network service to invoke these configurations:
service network restart
Once the restart is successful, you should see bond0 interface appearing in all nodes with one of these IP addresses in each node: 8.0.0.1~4. Now we should have the IP connectivities in both management network and fabric network.
Next we’ll need to install ansible and use it to automate the rest part of the
installations. The most part of ansible’s magic is done throught its
playbooks, and configuration for all plays is done in a single file with a
default name instances.yaml. This configuration file has multiple main
sections. We’ll go over some of the main parameters in this file and then
introduce the steps to run the playbooks.
the configuration file instances.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
1 global_configuration:
2 CONTAINER_REGISTRY: svl-artifactory.juniper.net/contrail-nightly
3 REGISTRY_PRIVATE_INSECURE: True
4 provider_config:
5 bms:
6 ssh_pwd: c0ntrail123
7 ssh_user: root
8 ntpserver: 10.84.5.100
9 domainsuffix: englab.juniper.net
10 instances:
11 a7s2:
12 provider: bms
13 ip: 10.84.27.2
14 roles:
15 openstack_control:
16 openstack_network:
17 openstack_storage:
18 openstack_monitoring:
19 config_database:
20 config:
21 control:
22 analytics_database:
23 analytics:
24 webui:
25 a7s3:
26 provider: bms
27 ip: 10.84.27.3
28 ssh_user: root
29 ssh_pwd: c0ntrail123
30 roles:
31 openstack_compute:
32 vrouter:
33 PHYSICAL_INTERFACE: bond0.101
34 CPU_CORE_MASK: "0x1fe"
35 DPDK_UIO_DRIVER: uio_pci_generic
36 HUGE_PAGES: 32000
37 AGENT_MODE: dpdk
38 a7s4:
39 provider: bms
40 ip: 10.84.27.4
41 ssh_user: root
42 ssh_pwd: c0ntrail123
43 roles:
44 openstack_compute:
45 vrouter:
46 PHYSICAL_INTERFACE: bond0.101
47 CPU_CORE_MASK: "0x1fe"
48 DPDK_UIO_DRIVER: uio_pci_generic
49 HUGE_PAGES: 32000
50 AGENT_MODE: dpdk
51 a7s5:
52 provider: bms
53 ip: 10.84.27.5
54 ssh_user: root
55 ssh_pwd: c0ntrail123
56 roles:
57 openstack_compute:
58 vrouter:
59 PHYSICAL_INTERFACE: bond0.101
60 contrail_configuration:
61 CONTRAIL_VERSION: 2008.108
62 OPENSTACK_VERSION: rocky
63 CLOUD_ORCHESTRATOR: openstack
64 CONTROLLER_NODES: 8.0.0.1
65 OPENSTACK_NODES: 8.0.0.1
66 CONTROL_NODES: 8.0.0.1
67 KEYSTONE_AUTH_HOST: 8.0.0.200
68 KEYSTONE_AUTH_ADMIN_PASSWORD: c0ntrail123
69 RABBITMQ_NODE_PORT: 5673
70 KEYSTONE_AUTH_URL_VERSION: /v3
71 IPFABRIC_SERVICE_IP: 8.0.0.200
72 VROUTER_GATEWAY: 8.0.0.254
73 two_interface: true
74 ENCAP_PRIORITY: VXLAN,MPLSoUDP,MPLSoGRE
75 AUTH_MODE: keystone
76 CONFIG_API_VIP: 10.84.27.51
77 ssh_user: root
78 ssh_pwd: c0ntrail123
79 METADATA_PROXY_SECRET: c0ntrail123
80 CONFIG_NODEMGR__DEFAULTS__minimum_diskGB: 2
81 CONFIG_DATABASE_NODEMGR__DEFAULTS__minimum_diskGB: 2
82 DATABASE_NODEMGR__DEFAULTS__minimum_diskGB: 2
83 XMPP_SSL_ENABLE: no
84 LOG_LEVEL: SYS_DEBUG
85 AAA_MODE: rbac
86 kolla_config:
87 kolla_globals:
88 kolla_internal_vip_address: 8.0.0.200
89 kolla_external_vip_address: 10.84.27.51
90 contrail_api_interface_address: 8.0.0.1
91 keepalived_virtual_router_id: "111"
92 enable_haproxy: "yes"
93 enable_ironic: "no"
94 enable_swift: "no"
95 kolla_passwords:
96 keystone_admin_password: c0ntrail123
97 metadata_secret: c0ntrail123
98 keystone_admin_password: c0ntrail123
-
line 1-3: global configrations
-
line 2: specifies the registry from which to pull Contrail containers
-
line 3: set to "True" if containers that are pulled from a private registry named
CONTAINER_REGISTRYare not accessible -
line 4-9: configures provider-specific settings
-
line 5: bare metal server (bms) environment
-
line 6-9: ssh password, user name, ntpserver and domainsuffix
-
line 10-59:
Instancesmeans the node on which the containers will be launched. here we defined 4 nodes, named a7s2, a7s3, a7s4 and a7s5 -
line 11-24: this is the configuration section for node
a7s2 -
line 12-14: defines this server’s provider type (baremetal server), ip address, and roles
-
line 14-24: roles of containers that will be installed in this node. according to the configuration, this server
a7s2will be installed with all "controller" softwares modules, in both openstack and contrail. -
line 25-37: define parameters for our first DPDK compute node. openstack compute components and contrail vRouter will be installed.
-
line 33: under vRouter,
bond0.101will be thePHYSICAL_INTERFACE, which is also called a "fabric interface" which carries all the underlay data packets -
line 34-37: these are the DPDK specific configurations. For kernel based vRouter these are not needed.
-
line 34:
CPU_CORE_MASKdefines DPDK vRouter forwarding lcore pinning.0x1fe, if converted to binary format, is0b000111111110. That means physical CPU core NO.1 through 8 is used as forwarding lcores: lcore#10 through lcore#17. -
line 35:
DPDK_UIO_DRIVERspecifies which UIO driver to use. Here it isuio_pci_generic. (There is another popular UIO driver:igb_uio.) -
line 36:
HUGE_PAGESdefines number of huge pages. Here we allocate 32000 huge pages. considering page size 2M it will be 64G memory usage in total.free -hcommand output in compute node will confirm this. -
line 37: set agent mode to
dpdk. -
line 38-50: define the second DPDK vRouter on server
a7s4 -
line 51-59: define the third vRouter. This is a kernal based one, so we don’t need any DPDK specific parameters.
-
line 60-85:
contrail_configurationsection contains parameters for Contrail services -
line 61-62: specifies contrail and openstack versions.
-
line 63: specifies the cloud orchestrator. it can be openstack or vcenter. our setup is with openstack only.
-
line 64-66: specify who is the controller node. In our setup both openstack and contrail controllers are installed in same node
-
line 71, 76: There are the two "virtual IP" configured
-
line 80-82: These are needed only for lab setup. without these parameters,
contrail-statuswill print warning to indicate that the storage space is too small. -
line 86-98: defines the parameters for Kolla
-
line 87-94: refers to OpenStack service
-
line 88-89: VIPs configured for management and data/ctrl network respectively. One usage of these VIPs is make it possible to access the openstack horizon service (webUI) from managent network. By default all OpenStack services listen on the the IP in data/ctrl network. With these VIPs configured and used by keepalived, HAproxy can forward the access request coming from the management network to the to Horizon service.
installation steps
Once the yaml file is carefully prepared, the installation process is
relatively easy. Basically we just need to install some pre-requisite software
packages, such as python libraries, git and ansible tools. git is required to
clone a github repository which includes all ansible playbooks. Then, we use
ansible to automate the installation in all nodes based on the playbooks and
our configuration file instances.yaml. The detail steps are here:
-
install pre-requisite packages on a7s2
yum -y remove python-netaddr yum -y install epel-release python-pip gcc python-cffi python-devel bcrypt==3.1.7 sshpass python-wheel pip install wheel requests yum -y install git pip install ansible==2.5.2.0
-
install ansible deployer
git clone http://github.com/tungstenfabric/tf-ansible-deployer cd tf-ansible-deployer
-
place the
instances.yamlconfiguration file totf-ansible-deployer/config
-
install contrail
ansible-playbook -i inventory/ -e orchestrator=openstack playbooks/configure_instances.yml ansible-playbook -i inventory/ playbooks/install_openstack.yml ansible-playbook -i inventory/ -e orchestrator=openstack playbooks/install_contrail.yml
-
install openstack client
pip install --ignore-installed python-openstackclient python-ironicclient openstack-heat
Once everything succeeds, you will have an up and running 4 nodes contrail cluster (1 controller node and 3 vRouter/compute node). You can login to the setup through webUI or ssh session to check the system running status.
post-installation verification
Here is the contrail web UI for a working setup:
You can also login to each individual nodes with ssh, and run contrail-status
command to verify the running status of each components of it.
If everything works, congratulations! You now have your own lab to play with. Next, we’ll go over the steps of setting up testing tools to send and receive traffic - the PROX and rapid script.
dpdk vRouter test tools: prox and rapid
introduction
PROX (Packet pROcessing eXecution Engine) is an OPNFV project application built on top of DPDK. It is capable of performing various operations on packets in a highly configurable manner. It also support performance statistics that can be used for performance investigations. Because of the rich feature set it supports, it can be used to create flexible software architectures through small and readable configuration files. In this chapter we’ll introduce how to use it to test vrouter performance in DPDK environment.
In a typical test you need two VMs running PROX. VM1 is generating packets, sending them to VM2 which will perform a "swap" operation on all packets, so that they are sent back to VM1.
-
"traffic generator" VM ("gen" VM)
-
"traffic receiver and looping VM" VM ("swap" VM, or "loop" VM)
In this book we will call them "gen" and "swap" VM respectively. One special feature we used here is that, the "swap" PROX is configured in such a way that, once receives the packets sent from the generator, it will "swap", or "loop" them back to the generator VM, so the latter can collect them and calculate how much traffic got forwarded by the DUT - in our case it is the DPDK vRouter.
Rapid(Rapid Automated Performance Indication for Dataplane) is a groups of "wrapper" scripts interacting with PROX to simplify and automate the configuration of PROX. It is a set of files and scripts offering an even easier way to do a sanity check of the dataplane performance.
rapid is very powerful and configurable. A typical workflow is like below:
-
A script name
runrapid.pywill send the proper configuration files to the gen and swap VMs involved in the testing, so each one will knows its role ("generator" or "swapper") in the test. -
It then starts PROX within both VMs, as generator and swapper respectively.
-
While the test is ongoing it collects the results from PROX. Results are printed on the screen and logged in the log and csv files.
-
The same tests will be done for different packet sizes and/or different amounts of flows.
The rapid scripts are typically installed in a third VM, called "jump" VM in this book. The purpose of this VM is to control the traffic generator to start, stop, pause the test as well as collecting the statistics.
A typical prox and rapid testing setup looks like this:
The test setup consists of three compute nodes, running the above mentioned 3 VMs respectively:
-
"PROX generate VM" runs on compute-A: This is the "traffic generator" VM for traffic generation
-
"PROX looping VM" runs on compute-B: This is the "swap" VM for looping traffic out of the same interface where it came in. This is the DUT (device under test) where the vRouter is running.
-
"rapid jump VM" runs on compute-C: This is the VM where rapid scripts are installed, it is responsible for control traffic genaration and collecting results
Here is a brief summary of hardware requirements for different VM:
-
swap VM: this is where the DUT (vRouter) is located. Based on the test requirement a specific amount of hardware resources should be allocated and all applications that could unnecessarily consume the hardware resources should be removed.
-
gen VM: In order to saturate the DUT, the traffic generator VM and the compute should be allocated much more CPU resources than the DUT.
-
Jump VM: no high speed VM is required, can be run on kernel or DPDK compute)
-
Optionally, the generator and receiver computes can run on a bonded interface configured with 802.3ad LACP mode. This is a common configuration recommended in practical environment.
|
Note
|
By default, multi-queue is enabled on both Prox gen and swap VMs via
openstack flavor. You can refer to chapter 3 for more details about
"multi-queue" feature and its configurations. Additionally, Rapid scripts also
provides CPU pining to protect PROX PMDs against CPU stealing by other
processes and the VM Operating System.
|
installation: manual steps
As mentioned earlier, to perform the test we need two VM both running PROX. One sending traffic and the other one receive and swap it back. Same exact PROX application is running but with different configuration files.
Apparently, the IP level connectivity is required in order for the two VM to be able to exchange packets with each other. In our case, the two VM will be spawned by openstack nova. Needless to say, all supporting objects and resources associcated to the VMs, like IPAM, subnet, virtual-network and VM flavor (size of CPU/memory/storage/etc), also need to be created out of openstack infrastructure, either from horizon webUI or openstack CLIes. A quick list of the common tasks are listed here:
-
create IPAMs/subnets/virtual networks
-
create flavors
-
create images
-
create host aggregates
-
create instances
-
create key-pairs
On top of these, installing PROX inside of the VMs, like with many other open source projects, often requires downloading the source code and compile it in your platform. That means you download the PROX source codes, compile it to get the execute, then configure and run the application. In this section we’ll introduce how PROX is installed in our setup we built for this book, You can find more details in PROX website here: https://wiki.opnfv.org/display/SAM/PROX+installation
The software and CPU model we use here are shown below:
[root@a7s3 ~]# cat /etc/centos-release CentOS Linux release 7.7.1908 (Core)
[root@a7s3 ~]# uname -a Linux a7s3 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@a7s3 ~]# lscpu | grep Model Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
In our lab setup the VM OS is the same as the host, and the emulated CPU Model
is Intel Xeon E3-12xx:
[root@stack2-gen ~]# cat /etc/centos-release CentOS Linux release 7.7.1908 (Core)
[root@stack2-gen ~]# uname -a Linux stack2-gen.novalocal 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@stack2-gen ~]# lscpu | grep -i Model Model: 58 Model name: Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)
|
Note
|
There is a good chance that your servers and VM may have totally different hardware and software architectures. The steps below are tested and working fine in our setup, but depending on your environment it may works just fine or run into some errors. Check PROX online document for more detailed instructions. |
PROX is a dpdk application. When running, it connects to the DPDK libraries to implement most of its features. Therefore to build it we need a DPDK environment.
|
Tip
|
You can either build it inside of the VM where you want to run it, or build it directly in the host environment where the VM got spawned and copy it into the VM. |
The steps to build DPDK in our setup is as below:
Add the following to the end of ~/.bashrc file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
sudo yum install numactl-devel net-tools wget gcc unzip libpcap-devel \
ncurses-devel libedit-devel pciutils lua-devel kernel-devel
export RTE_SDK=/root/dpdk
export RTE_TARGET=x86_64-native-linuxapp-gcc
export RTE_KERNELDIR=/lib/modules/`ls /lib/modules`/build
export RTE_UNBIND=$RTE_SDK/tools/dpdk_nic_bind.py
#Re-login or source that file
. ~/.bashrc
#Build DPDK
git clone https://github.com/DPDK/dpdk
cd dpdk
git checkout v19.11
make install T=$RTE_TARGET
Now with DPDK libraries built, we can start to download, extract and build the PROX application. Here are the steps:
1
2
3
4
git clone https://github.com/opnfv/samplevnf
cd samplevnf/VNFs/DPPD-PROX
git checkout origin/master
make
When make succeeds, the compiled binary PROX will be available in build
folder of current directory.
We’ll demonstrate this later.
The set of sample configuration files can be found in: ./config folder.
Sample configs of PROX functioning as the "generator" is available in ./gen/
folder.
Assuming the current directory is where you’ve just built PROX, we can just launch PROX with a proper configuration file.
./build/prox -f <prox configuration file>
When it runs, a ncurse based UI will pop up and through it you will see update about the running states in real time. We’ll give an example on this later.
Rapid scripts can be downloaded from here: https://github.com/opnfv/samplevnf/tree/master/VNFs/DPPD-PROX/helper-scripts/rapid The scripts were developed in python, so you can run them directly and no need to compile.
installation: heat automation
We have just introduced the steps of manually compiling PROX from source code. We also has assumed you know how to perform a list of tasks to create all necessary objects required by the VMs from openstack. Doing this one time is not a big deal. Suppose you are working in a dynamic environment where you often need to:
-
quickly build up a PROX test environment to do some tests
-
tear it down after the test is finished
-
redo the same test all over again in another cluster
Repeating these manual steps will become a tedious and even painful job. You
will soon prefer to be able to simplify the building, creation and
configuration of PROX, as well as creating all necessary openstack resources.
In openstack environment the NO. 1 choice for automation is heat. With
heat, typically all tasks are programmed in a template file, with calls all
parameters from another environment file. In appendix, we provide all sample
template file as long as environment file and associcated scripts, which are
tested and proved to be working fine in our setup. You can use them as a
starting point, then make necessary customizations based on your environment to
build your owen automation. The virtual machine, where the tools are running,
including rapid scripts and PROX DPDK application that is pre-compiled in it,
has also been built as an image . With all these automations carefully designed
and tested, all what we need to do now becomes much simpler:
-
download this pre-built image and load it into openstack image service
-
create the heat stack with the sample template files
If everything goes well, you will have your whole PROX testing environment available in just a few minutes. The detail steps are listed below:
-
Prepare pre-built VM image, heat template files and scripts
-
VM image: this is the image with PROX compiled, as shown in previous section.
-
heat template: see appendix
-
-
load rapid image into opentack glance service
openstack image create --disk-format qcow2 --container-format bare --public --file rapidVM.qcow2 rapidVM-1908 openstack image set --property hw_vif_multiqueue_enabled="true" rapidVM-1908
-
(Optionally) if you’re using ceph backend:
qemu-img convert rapidVM-1908.qcow2 rapidVM-1908.raw openstack image create --disk-format raw --container-format bare --public --file rapidVM.raw rapidVM-1908 openstack image set --property hw_vif_multiqueue_enabled="true" rapidVM-1908
-
adjust the heat template files based on your environment
-
environment.yaml
-
build-rapid.yml
-
configure.rapid.sh
-
-
create heat stack:
openstack stack create -t build-rapid.yml -e environment.yaml stack2
Wait for a few minutes and use openstack stack list command to check the
stack creation status.
Once succeeded, you can use different sub-command of openstack stack command
to retrieve the parameters of the stack components.
1
2
3
4
5
openstack stack list STACK
openstack stack resource list
openstack stack resource list --filter type=OS::Nova::Server
openstack stack show STACK
openstack stack output show STACK
The image has been configured with a root password Login c0ntrail123. So all
3 VMs, once up and running, will inheritage the same login credential. In
contrail/openstack integration environment There are a few common ways to
access a VM running in a specific compute node:
-
floating IP: This is an routable IP address that is visible from outside of the cluster which maps to an internal IP of the VM. Once VM is launched, you can login to a specific VM with this IP address from anywhere that is able to reach the IP.
-
virsh console: virsh provides access to the VM console. This does not require any IP address to be configured.
-
meta_ip_address: This is a non-routable private IP that visible only from a specific compute. This IP address is automatically generated and mapped to the VM’s tap interface IP.
In our test we didn’t configure any floating IP, so we will use console and
meta_ip_address to access the VM. To access VM console use virsh console
command from nova_libvirt docker in the compute node:
[root@a7s3 ~]# docker exec -it nova_libvirt virsh list Id Name State ---------------------------------------------------- 2 instance-00000041 running
[root@a7s3 ~]# docker exec -it nova_libvirt virsh console 2 Connected to domain instance-00000041 Escape character is ^]
CentOS Linux 7 (Core) Kernel 3.10.0-1062.18.1.el7.x86_64 on an x86_64
stack2-gen login: root Password: Last login: Fri Sep 25 17:31:21 from 192.168.0.2 [root@stack2-gen ~]#
Comparing with console, ssh session is usually preferred. Let’s take a look
at each VM’s allocated interface IPs with openstack server list command:
let’s take our "jump" VM stack2-jump for instance. Openstack allocated an IP
address 192.168.0.106 to it’s tap interface from the stack2-control
virtual-network. However, this IP address is not directly reachable from the
host. In order to ssh into the VM, we need to first locate the
meta_ip_address allocated to the VM’s tap interface, or more specifically,
the vif interface in vRouter. We can use vRouter vif command to confirm
which vif interface has this IP.
1
2
3
4
5
6
7
8
9
[root@a7s5-kiran ~]# contrail-tools vif -l | grep -B2 -A6 192.168.0.106
vif0/3 OS: tap0160123b-14 NH: 28
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.0.106
Vrf:2 Mcast Vrf:2 Flags:PL3L2DEr QOS:-1 Ref:6
RX packets:47246 bytes:2362255 errors:0
TX packets:42996 bytes:2133684 errors:0
ISID: 0 Bmac: 02:01:60:12:3b:14
Drops:3553
Good. vif0/3 has the IP, so this vif connects to the tap interface of our
jump VM. In contrail vRouter, for each vif there is also a "hidden"
meta_data_ip of "169.254.0.N", wherre N is the same number as in the
vif0/N. Therefore in this case our meta_data_ip is "169.254.0.3". Let’s try
to start a ssh session into it:
[root@a7s5-kiran ~]# ssh 169.254.0.3 Password: Last login: Wed Sep 23 11:13:58 2020 [root@stack2-jump ~]#
It works. The benefit of this approach is that, not only the interaction with
the VM is much faster, but also it supports file copies with scp tool.
Remember in many cases the VM does not has any Internet connection, so in case
you need to copy files into (or out of) the VM, the meta_data_ip method will
be especially useful.
run rapid automation: runrapid.py
With the stack created and all VMs up and running, we now can introduce how to run test with rapid. Remember rapid is installed in the "jump" VM, so we’ll need to execute the script from there.
On Jump VM, go to /root/prox/helper-scripts/rapid/ folder, where you can
locate a python script named "runrapid.py". To run test you can just run it
without any other parameters:
cd /root/prox/helper-scripts/rapid/ ./runrapid.py
This will start rapid script and send traffic for 10 seconds by default. the
period of time for sending traffic can be adjusted by --runtime option:
cd /root/prox/helper-scripts/rapid/ ./runrapid.py --runtime <time> # replace <time> with time per one execution in seconds
A few other command line options are supported, which can be listed by -h:
[root@stack2-jump rapid]# ./runrapid.py -h
usage: runrapid [--version] [-v]
[--env ENVIRONMENT_NAME]
[--test TEST_NAME]
[--map MACHINE_MAP_FILE]
[--runtime TIME_FOR_TEST]
[--configonly False|True]
[--log DEBUG|INFO|WARNING|ERROR|CRITICAL]
[-h] [--help]
Command-line interface to runrapid
optional arguments: -v, --version Show program's version number and exit --env ENVIRONMENT_NAME Parameters will be read from ENVIRONMENT_NAME. Default is rapid.env. --test TEST_NAME Test cases will be read from TEST_NAME. Default is basicrapid.test. --map MACHINE_MAP_FILE Machine mapping will be read from MACHINE_MAP_FILE. Default is machine.map. --runtime Specify time in seconds for 1 test run --configonly If this option is specified, only upload all config files to the VMs, do not run the tests --log Specify logging level for log file output, default is DEBUG --screenlog Specify logging level for screen output, default is INFO -h, --help Show help message and exit.
A typical runrapid.py script execution looks like this:
You can see that some preparation work were done before the actual test are started:
. First, the script read 3 files, rapid.env, basicrapid.test and
machine.map. The env file provides IP/MAC information of the gen and swap
VM, and the .test file defines all detail behavior of the test.
-
Then, the script connects to both gen and swap VM.
-
The script start some small amount of traffic as "warmup". This is to test The reachability between the source and destination, and also populate MAC table or ARP table in devices along the path.
-
When everything is ready, the script starts the traffic in certain speed and at the same time monitor the traffic receiving rate in real time. Any packet drop rate higher than the defined threshold indicates the current traffic rate is too high to the DUT, so it will drop the rate in the next iteration. By binary search, eventually, it finds the maximum throughput between 2 systems within a given allowed packet loss and accuracy which are defined in the
*.testfiles (e.g. thebasicrapid.testfile for a simple test)
The script is highly configurable. In appendix We provide a sample
"basicrapid.test" that we use in our lab. You can start with it and fine tune
based on your need. For example, in section [test2] of the file you can
change number of flow and packet size to define different test scenarios.
[test2] test=flowsizetest packetsizes=[64,256,512,1024,1500] # the number of flows in the list need to be powers of 2, max 2^20 # Select from following numbers: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65535, 131072, 262144, 524280, 1048576 flows=[16384, 65535]
run PROX manually
OK. We just introduced rapid. The script support very extensive options in the
configuration files which beyond the scope of this book, but we’ve got the idea
how it works basically. Please remember that rapid and PROX and two different
applications. Rapid script does all the magics and make your life easier
through automation of PROX, and PROX is the foundation application that does
the "real" works. In fact, PROX can run tests just fine without Rapid. To
launch PROX and start traffic, in the "gen" VM’s home folder (root in our
case) start this command:
[root@stack2-gen ~]# /root/prox/build/prox -f /root/gen.cfg
PROX will parse its configuration file /root/gen.cfg and start to boot.
from the booting messages in the screen we can learn its booting sequences:
-
setuping the DPDK environment (RTE EAL)
-
initializing (rte) devices,
-
initializing mempools, port addresses, queue numbers and rings on cores
-
initializing DPDK ports
-
initializing tasks
-
start the test and display a
ncursebased text UI
You will end up with a ncurse based UI like below:
The display shows per task statistics which include: estimated idleness, per
second statistics for packets received, transmitted or dropped; per core cache
occupancy, cycles per packet, etc. These statistics can help pinpoint
bottlenecks in the system. This information can then be used to optimize the
configuration. There are quite a few other features include debugging support,
scripting, Open vSwitch support, etc. Refer to PROX website for more details.
For now, let’s look at how the traffic flows. Right now from the screenshot
above we only see traffic being sent, but nothing gets received yet. Reason is
we are now running PROX manually and we only starting the "gen" side, which is
the traffic "sender" only. We need to start the "swap" VM as well as a
"receiver", who will also "loop" the traffic back to the sender, so our first
PROX application will see some "RX" statistics. Let’s do that. On the compute
where "swap" VM is installed, execute the same prox command line, except this
time we pass a different configuration file named swap.cfg:
Here you will end up with a similiar ncurse based text UI, after similiar booting process as of the sender. Once our "swap" end of PROX is up and running, immediately you will see both "RX" and "TX" counters keep updating on both side of the traffic:
That concludes our discussion of PROX and rapid as our testing tools. We’ll use this tools intensively in the rest of this chapter to generate different kinds of traffic in each tests. With the traffic running, we can dig deeper to understand the rules we’ve introduced about how vRouter works. Next we’ll introduce some of the commonly used tools that are designed for, or especially useful for verfications in DPDK vRouter environment.
dpdk vRouter tool box
In this book you’ve read a lot of details about DPDK and contrail DPDK vRouter implementations. You should understand that performance boost is the main benefit it brings. As with almost everything, it has both pros and cons. One problem is that is commonly raised is the lack of tools during troubleshooting process, especially in the case of a traffic loss problems. Within traditional linux world, there are tons of well-known tools to trace the packet, from displaying packet statistics in and out of NIC, showing drop counters, to performing packet capture for deeper level packet decoding. Examples of these tools are like ifconfig, ip, bmon, tcpdump, tshark, etc. With DPDK, however, none of the traditional tools can be used directly, and the reason is obvious: whichever interface bound to DPDK becomes invisible to the linux stack, hence are also hidden from the perspective of these tools relying on it. In production, we need some new tools developed to fill this gap, so that we can narrow the packet loss related issues when the outage is ongoing. Fortunately, today contrail dpdk vRouter are equiped with quite a few such tools. In this section we’ll look at some of them.
"contrail-tools" docker: vRouter tools box
"contrail-tools" is a docker container located in the compute node, where all of the vRouter tools and utilities are available. Apparently, from the user perspective, this is more convenient than distributing tools into multiple containers. This design was introduced a few releases before contrail networking R2008. As more and more existing tools migrated into it and new tools added in, this container now really becomes a centralized "tool box", which you’d like to open whenever you want to check any running states of the vRouter dataplane. Let’s first take a look at how to "open" this "box".
To enter the container, just run contrail-tools script (same name as of the
docker) in a compute node.
[root@a7s3 ~]# contrail-tools Unable to find image 'svl-artifactory.juniper.net/contrail-nightly/contrail-tools:2008.108' locally 2008.108: Pulling from contrail-nightly/contrail-tools f34b00c7da20: Already exists b3779b5a313a: Already exists 4b95f42cde64: Already exists 8b329f8ee1e6: Already exists 2986115b3d27: Already exists 10c5940c4895: Already exists dec794e181cd: Already exists 226c056c5788: Already exists d391962e0038: Pull complete Digest: sha256:2d68d8cd010ba76c265c3b7458fcf12c459d46ec71357b45118dfc4610f40338 Status: Downloaded newer image for svl-artifactory.juniper.net/contrail-nightly/contrail-tools:2008.108 (contrail-tools)[root@a7s3 /]$
Now you are inside of the container. From here you can test all of the old vRouter tools you may have been familiar with, for example, to print the packet dropping statistics:
1
2
3
4
5
(contrail-tools)[root@a7s3 /]$ dropstats | grep -iEv " 0$|^$"
Flow Action Drop 1792
Flow Queue Limit Exceeded 305
Invalid NH 12
No L2 Route 1
|
Tip
|
we use grep to remove all counters with a zero value. |
When you are done, just exit the docker and it will be killed.
(contrail-tools)[root@a7s3 /]$ exit exit [root@a7s3 ~]#
you can also pass the tool command as parameters to the script, execute the command, get its output, exit the docker, all with one go.
[root@a7s3 ~]# contrail-tools dropstats | grep -iE route No L2 Route 68129939 [root@a7s3 ~]#
As the time of the writing of this book, there are nearly 20 tools available in this container. Let’s take a look at what’s in the package.
First, in the container we’ll locate the package name:
[root@a7s3 ~]# contrail-tools lcontrail-tools)[root@a7s3 /]$ rpm -qa | grep contrail-tool contrail-tools-2008-108.el7.x86_64
Then, based on the package name, we can list all available tools in it:
(contrail-tools)[root@a7s3 /]$ repoquery -l contrail-tools-2008-108.el7.x86_64 | grep bin /usr/bin/dpdkinfo /usr/bin/dpdkvifstats.py /usr/bin/dropstats /usr/bin/flow /usr/bin/mirror /usr/bin/mpls /usr/bin/nh /usr/bin/pkt_droplog.py /usr/bin/qosmap /usr/bin/rt /usr/bin/sandump /usr/bin/vif /usr/bin/vifdump /usr/bin/vrfstats /usr/bin/vrftable /usr/bin/vrinfo /usr/bin/vrmemstats /usr/bin/vrouter /usr/bin/vxlan
In previous chapters you’ve read about dpdk_nic_bind.py script, which is a
tool to tell bind a specific driver for a NIC. In the rest of this section,
we’ll introduce some more tools that is especially useful in DPDK environment.
vif command and scripts
The first one from our contrail DPDK "tool box" is vif command. Before
talking about it, let’s see how do we list all interfaces in the compute
running DPDK vRouter. Let’s first try the linux ip or ifconfig command in
our DPDK compute running PROX gen VM:
1
2
3
4
5
6
7
8
9
10
11
[root@a7s3 ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:4c:16:c2 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:4c:16:c3 brd ff:ff:ff:ff:ff:ff
8: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:56:4f:cc:6e brd ff:ff:ff:ff:ff:ff
25: vhost0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 90:e2:ba:c3:af:20 brd ff:ff:ff:ff:ff:ff
Well, we do see some interfaces got printed:
-
the loop interface (lo)
-
managment interface (eno1)
-
vhost0 interface
-
docker interface (docker0)
-
physical NIC which is not in use (eno2)
However, some most important interfaces are not shown at all:
-
The physical fabric interface: the "bond" interface in our setup
-
The VM virtual interfaces: the "tapxxx" interfaces
If we compare with what we would see with the same ip command in a kernel mode vRouter compute without DPDK, we will see the big differences:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@a7s5-kiran ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:47:d7:b4 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:47:d7:b5 brd ff:ff:ff:ff:ff:ff
4: enp2s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
5: enp2s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
12: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:d6:c6:2c:12 brd ff:ff:ff:ff:ff:ff
41: pkt1: <UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void c2:6e:97:ef:cd:b2 brd 00:00:00:00:00:00
42: pkt3: <UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void 8e:44:4e:2e:28:0c brd 00:00:00:00:00:00
43: pkt2: <UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void a6:2a:01:7c:db:65 brd 00:00:00:00:00:00
44: vhost0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
45: bond0.101@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:1b:21:bb:f9:46 brd ff:ff:ff:ff:ff:ff
46: pkt0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 5e:a0:f8:77:25:97 brd ff:ff:ff:ff:ff:ff
49: tap0160123b-14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether fe:01:60:12:3b:14 brd ff:ff:ff:ff:ff:ff
Here except lo, management interface and whatever we saw from the DPDK compute, we also see these all other important interfaces:
-
bond interface and it’s subinterface:
bond0,bond0.101 -
bond interface’s member interfaces:
enp2s0f0,enp2s0f1 -
VM tap interface:
tap0160123b-14 -
pkt0interface
|
Tip
|
pkt1, pkt2, pkt3 interfaces are created by vRouter but not used in dpdk setup |
The reason we see these differences, as we’ve mentioned many times throughout this book, is that when DPDK is in charge of the NIC card, linux kernel is mostly "bypassed". The NIC card’s feature and functions are exposed by another special driver directly to the user space PMD driver running in DPDK layer, so the traditional applications, whichever relies on the interfaces sitting in linux kernel to do its job, are no more useful.
We’ll talk more about this later. for now, let’s look at the vif command with
-l|--list and --get option. vif --list lists all interfaces located in
the vRouter and --get just retrieves one of them. Here is the capture from
the same DPDK compute:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
[root@a7s3 ~]# contrail-tools vif --get 3
Vrouter Interface Table
Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror
Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2
D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged
Mnp=No MAC Proxy, Dpdk=DPDK PMD Interface, Rfl=Receive Filtering Offload, Mon=Interface is Monitored
Uuf=Unknown Unicast Flood, Vof=VLAN insert/strip offload, Df=Drop New Flows, L=MAC Learning Enabled
Proxy=MAC Requests Proxied Always, Er=Etree Root, Mn=Mirror without Vlan Tag, HbsL=HBS Left Intf
HbsR=HBS Right Intf, Ig=Igmp Trap Enabled
vif0/3 PMD: tap41a9ab05-64 NH: 32
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:3 Mcast Vrf:3 Flags:PL3L2DMonEr QOS:-1 Ref:12
RX queue packets:2306654691 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:2306869103 bytes:285898139558 errors:0
TX packets:47613036 bytes:5739655392 errors:0
ISID: 0 Bmac: 02:41:a9:ab:05:64
[root@a7s3 ~]# contrail-tools vif -l
Vrouter Interface Table
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
RX device packets:106218495224 bytes:12108991404264 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
RX packets:53109240518 bytes:5842056828972 errors:0
TX packets:53459418469 bytes:5880886194306 errors:0
Drops:291
TX device packets:106919210258 bytes:12189494593618 errors:0
vif0/1 PMD: vhost0 NH: 5
Type:Host HWaddr:90:e2:ba:c3:af:20 IPaddr:8.0.0.4
Vrf:0 Mcast Vrf:65535 Flags:L3DEr QOS:-1 Ref:13
RX device packets:436036 bytes:400358720 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:436036 bytes:400358720 errors:0
TX packets:447092 bytes:88525732 errors:0
Drops:3
TX device packets:447092 bytes:88518904 errors:0
vif0/2 Socket: unix
Type:Agent HWaddr:00:00:5e:00:01:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3Er QOS:-1 Ref:3
RX port packets:71548 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:71548 bytes:6153128 errors:0
TX packets:14936 bytes:1359697 errors:0
Drops:0
vif0/3 PMD: tap41a9ab05-64 NH: 38
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:2 Mcast Vrf:2 Flags:L3L2DEr QOS:-1 Ref:12
RX queue packets:17708866065 errors:3874701360
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 3874691664 9696
RX packets:17708865121 bytes:1062531327800 errors:0
TX packets:17563478684 bytes:1053808124972 errors:0
ISID: 0 Bmac: 02:41:a9:ab:05:64
Drops:3874701393
vif0/4 PMD: tapd2d7bb67-c1 NH: 35
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.0.104
Vrf:3 Mcast Vrf:3 Flags:PL3L2DEr QOS:-1 Ref:12
RX queue packets:3060 errors:205
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 205 0
RX packets:5478 bytes:528770 errors:0
TX packets:5402 bytes:423320 errors:0
Drops:445
Here the vRouter interfaces are:
-
vif0/0: this connects to the
bondinterface -
vif0/1: this connects to
vhost0, the interface in linux kernel -
vif0/2: this connects to the
pkt0interface toward vrouter agent -
vif0/3: this is the vRouter interface connecting the data interface of our PROX VM:
tap41a9ab05-64 -
vif0/4: this is the vRouter interface connecting the control and management interface of our PROX VM:
tapd2d7bb67-c1
Now you should understand the importance of vif command, especially in DPDK
vRouter. It shows interfaces from vRouter’s perspective, and reveals the one
to one connection mapping between vRouter and fabric or VM tap interface. The
latter would be "invisible" otherwise.
Besides that, it also prints other important information. The Vrf numbers and
packet counters are the most commonly used data points. Among various
counters, usually we focus on the RX/TX packets/bytes counters which
displays data received or sent in packets or bytes. Depending on your
environment, sometime you may also see non-zero numbers in RX/TX queue
packets/errors counter that gives inter lcore packet statistics. It is usually
happens when two lcores are involved in the packet forwarding path. We’ll use
this command intensively in the rest of this chapter, and we’ll analyze these
counters and use them to understand some of the important vRouter working
mechanisms.
vif tool also support some other options, use --help to print a brief list
of all currently supported options.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@a7s3 ~]# contrail-tools vif --help
Usage: vif [--create <intf_name> --mac <mac>]
[--add <intf_name> --mac <mac> --vrf <vrf>
--type [vhost|agent|physical|virtual|monitoring]
--transport [eth|pmd|virtual|socket]
--xconnect <physical interface name>
--policy, --vhost-phys, --dhcp-enable]
--vif <vif ID> --id <intf_id> --pmd --pci]
[--delete <intf_id>|<intf_name>]
[--get <intf_id>][--kernel][--core <core number>][--rate] [--get-drop-stats]
[--set <intf_id> --vlan <vlan_id> --vrf <vrf_id>]
[--list][--core <core number>][--rate]
[--sock-dir <sock dir>]
[--clear][--id <intf_id>][--core <core_number>]
[--help]
We won’t talk about each every options and their usage and usually you don’t
need to know anything except --get and -l|--list. There is one more
(--add) which we’ll talk about shortly. For others you can refer to
https://www.juniper.net/documentation/en_US/contrail20/topics/task/configuration/vrouter-cli-utilities-vnc.html
for more details.
Next, let’s look at two useful scripts that are developed based on vif
command: dpdkvifstats.py and vifdump.
dpdkvifstats.py script
We’ve seen vif command prints all interfaces and its traffic statistics
(RX/TX packets/bytes/errors, RX queue packets/errors, etc) in the form of a
"list". During testing or troubleshooting, we can collect these data to
evaluate the vRouter forwarding performance, its running status, is it losing
packets or not, etc. In production, we always need to examine the traffic
passing through a compute. Same thing in lab, once you start traffic from PROX
or any other traffic generators, the first thing you want to check is the
traffic rate on interfaces. In fact there are at least two common tasks in
practice:
-
monitor the traffic forwarding "rate" (instead of only number of packets)
-
compare statistics between different vif interfaces
Starting from R2008 a python script named dpdkvifstat.py is provided, which
collects the statistics from vif output, calcuates the changing rate of all
counters in pps and bps, then prints the result in a table format. This
makes the output looks much "prettier", and also makes comparison accross vif
interfaces much easier.
|
Tip
|
In fact vif command also provides --list --rate options to print
traffic rate. However, it is lacking of itemlized per-lcore statistics and the
display is not easy to be collected in a file.
|
Let’s take a look:
To understand the output, first let’s review the DPDK vRouter cpu cores allocation.
In chapter 3, you’ve learned about DPDK vRouter architectures and you know how the packet processing works. Basically, vRouter creates same number of lcores and DPDK queues as the number of CPUs allocated to it. In this compute, for testing purpose, we’ve allocated 2 CPU cores to vRouter dpdk forwarding lcores. Therefore, for each vRouter interface, 2 DPDK queues are created, each served by a forwarding lcore in DPDK process. That is why that in the output for each vif interface there are 2 lines statistics, for "Core 1" and "Core 2" respectively.
|
Note
|
CPU allocatioin to DPDK vRouter forwarding lcores is configurable via options in vRouter configuration files. details of CPU allocation implementation is beyond the scope of this book. |
Now let’s look at the counters. To demonstrate how the script works, in our testbed we have configured PROX to send traffic at a constant speed of 125000 Bytes per second (Bps) with minimum packet size of 64 bytes. That calculates to about 1.4K packet per second (PPS).
We then run the script two times. First, we run the script to show traffic rate
for vif0/3 (-v), then we execute it again to show traffic rate for all (-a)
vif interfaces for comparison purpose. In both execution, per-lcore statistics
of a specific interface are given seperately. With -v option, the "total"
value of the interface is also given, which is the addition of counters from
all cores. This gives a per-interface statistics. With -a, the script also
calculates RX/TX/RX+TX traffic rate for each lcore across all interfaces in the
end. This give the overall lcore forwarding load in the DPDK vRouter.
This is very straightforward. To comparing with the vif output, let’s check
what the "raw" data looks like if without dpdkvifstats.py script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[root@a7s3 ~]# date; contrail-tools vif --get 3; sleep 10; date; contrail-tools vif --get 3
Wed Oct 7 07:08:36 PDT 2020
......
vif0/3 PMD: tap41a9ab05-64 NH: 38
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:3 Mcast Vrf:3 Flags:L3L2DEr QOS:-1 Ref:12
RX queue packets:1457762899 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:1457893340 bytes:87471243818 errors:0
TX packets:208763 bytes:10136442 errors:0
ISID: 0 Bmac: 02:41:a9:ab:05:64
Drops:33
Wed Oct 7 07:08:47 PDT 2020
......
vif0/3 PMD: tap41a9ab05-64 NH: 38
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:3 Mcast Vrf:3 Flags:L3L2DEr QOS:-1 Ref:12
RX queue packets:1457797939 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
RX packets:1457928405 bytes:87473347268 errors:0
TX packets:208788 bytes:10137492 errors:0
ISID: 0 Bmac: 02:41:a9:ab:05:64
Drops:33
We capture the interface data, wait for 10 seconds, and capture it again. After that we can calculate the differences of all counters between the two captures. We then divide each difference by 10 to get the increasing "rate" of each counter.
-
pps - packets per second:
(1457928405-1457893340)/10 = 3506.5 -
Bps - bytes per second:
(87473347268-87471243818)/10 = 210345 -
bps - bit per second:
210345 * 8 = 1682760
…TODO: the number still does not match to script result well…
To monitor multiple vif interfaces we have to repeat these steps multiple times. Compare these manual works with having a handy script doing everything for you!
dpdkvifstats.py script is useful to quickly retrieve a snapshot of current
traffic profile at the moment and basically that’s it. When everything goes
well that is fine. In the case of traffic loss, we often need to first
"capture" the packets themselves. Then based on the packet capture, we can
decode the payload, and analyze the issue. Now you may say: oh you mean the
tcpdump! Well, Yes and no. Please remember the fact that we are in a setup
where NIC card is invisible to most of the linux applications - including
tcpdump! Next let’s briefly go over this DPDK vRouter packet capture script:
vifdump.
vifdump script
In many linux machine tcpdump comes with the OS as part of a standard
packets. With that you can capture whatever packets sensed by a NIC, which
can be either physical NIC or virtual NIC like a tuntap interface, both NIC
are visible to the kernel. In DPDK environment, the difficulty of an interface
not being visible to the kernel makes tcpdump not able to work, unless you
just want it to read packets from a file. Fortunately, we now know that each
interface related to vRouter dataplane connects to a unique vRouter interface
(vif). We can make use of this fact and create something alternative.
vifdump is a shell script, when invoked, it use --add option of vif
command to creates a "monitoring" tun interface in linux kernel, and internally
vRouter will clones all data to be passing through the "monitored" vif
interface to this kernel interface. vifdump will then start up the tcpdump
program to capture the packets from the "monitoring" tun interface. From a
user’s perspective, the script works the same way as with tcpdump. Here are 2
captures on vif0/3 toward VM, that is our PROX gen, and on vif0/0 toward
fabric interface:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[root@a7s3 ~]# contrail-tools vifdump -i 3 -n -c 3
vif0/3 PMD: tap41a9ab05-64 NH: 32
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mon3, link-type EN10MB (Ethernet), capture size 262144 bytes
13:12:31.286528 IP 192.168.1.104.filenet-cm > 192.168.1.105.filenet-nch: UDP, length 82
13:12:31.286532 IP 192.168.1.104.filenet-rmi > 192.168.1.105.filenet-pch: UDP, length 82
13:12:31.286540 IP 192.168.1.104.filenet-rpc > 192.168.1.105.filenet-pa: UDP, length 82
3 packets captured
401 packets received by filter
271 packets dropped by kernel
vifdump: deleting vif 4348...
[root@a7s3 ~]# contrail-tools vifdump -i 0 -n -c 3
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mon0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:12:23.796516 IP 8.0.0.4.55184 > 8.0.0.2.4789: VXLAN, flags [I] (0x08), vni 8
IP 192.168.1.104.filenet-pa > 192.168.1.105.filenet-nch: UDP, length 82
13:12:23.796522 IP 8.0.0.4.54530 > 8.0.0.2.4789: VXLAN, flags [I] (0x08), vni 8
IP 192.168.1.104.filenet-rmi > 192.168.1.105.filenet-pa: UDP, length 82
13:12:23.796531 IP 8.0.0.4.63363 > 8.0.0.2.4789: VXLAN, flags [I] (0x08), vni 8
IP 192.168.1.104.filenet-nch > 192.168.1.105.filenet-pch: UDP, length 82
3 packets captured
334 packets received by filter
271 packets dropped by kernel
vifdump: deleting vif 4351...
[root@a7s3 ~]#
The shell script also use unix trap to monitor signals, and delete the monitoring
interface when the signals appear. The mostly used signal is SIGINT triggered
by ctrl-c keystroke pressed by user from keyboard to stop the capture. That
is why we see vifdump: deleting vif 4351… message in the end of each capture.
dpdkvifstats.py and vifdump are two scripts developed based on vif
command. With these tools we can collect general packet RX/TX counters and
packets contents.
In next section, we’ll take a look another powerful debug tool that is
useful in DPDK environment: dpdkinfo.
dpdkinfo command
We’ve talked about vif and dpdkvifstats.py tools. Now let’s introduce a
relatively new tool that can be used to investigate lower level details of DPDK
interfaces. dpdkinfo is introduced since Contrail 20.08. Using this tool
Contrail operators can collect more information about DPDK vRouter fabric
interface internal status, connectivity (physical NIC bond), DPDK library
information, and some other statistics.
Let’s first run the tool with -h to get a brief menu of it:
1
2
3
4
5
6
7
8
9
10
11
12
13
(contrail-tools)[root@a7s3 /]$ dpdkinfo -h
Usage: dpdkinfo
--help
--version|-v Show DPDK Version
--bond|-b Show Master/Slave bond information
--lacp|-l <all/conf> Show LACP information from DPDK
--mempool|-m <all/<mempool-name>> Show Mempool information
--stats|-n <eth> Show Stats information
--xstats|-x <=all/=0(Master)/=1(Slave(0))/=2(Slave(1))>
Show Extended Stats information
--lcore|-c Show Lcore information
--app|-a Show App information
Optional: --buffsz <value> Send output buffer size (less than 1000Mb)
From this help information we can see it provides information about DPDK interface in multiple areas. In this rest of this section, let’s take a look at some of the most useful options, they are:
-
--version|-v -
--bond|-b -
--lacp|-l -
--stats|-n -
--xstats|-x -
--lcore|-c
There are some other options like --app|-a, --mempool|-m we won’t
introduced in this book, and the list of supported functions may grow in each
future releases. But you will get the basic idea of its usage and you can refer
the official documents for other usage informations.
version
The -v or --version option reports the basic version information of dpdk release in use.
1
2
3
(contrail-tools)[root@a7s3 /]$ dpdkinfo -v
DPDK Version: DPDK 19.11.0
vRouter version: {"build-info": [{"build-time": "2020-09-04 10:38:22.330666", "build-hostname": "6fb64a1f86b9", "build-user": "root", "build-version": "2004"}]}
bond and LACP status
-b or --bond option print detail information about the bond interface
managed by DPDK. The output is organized in a similiar form as what you would
see for the bond status managed by linux kernel. Compare this output below with
cat /proc/net/bonding/bond0 output from a compute running kernel mode vRouter:
dpdkinfo -b vs. cat /proc/net/bonding/bond0Basically now you can have same information as of linux kernel bond0, such as
bonding mode, transmit hash policy, system MAC and aggregator information, etc.
In this example the current bonding mode is 802.3AD dynamic link aggregation,
indicating LACP protocol is configured between compute and peer device (in our
environment it’s a TOR switch). The Transmit Hash Policy shows Layer 3+4 (IP
Addresses + UDP Ports) transmit load balancing, which the bond allows for
traffic to a particular network peer to span multiple slaves for load balancing
purpose. This is achieved by calculating a hash value for each packet from
the IP addresses and UDP ports in the outer header of the packet, and then
distributing the packet based on the hash value.
The commmand outupt also displays each member(slave) link’s information, its currrent driver, MAC address, up/down status, etc. You may notice that the slave interface is identified using PCI bus number (0000:02:00.0 and 0000:02:00.1) instead of interface name as comparing with linux bond. Again the reason is that the interface name is created by linux kernel, which is "bypassed" in dpdk.
TODO: add some explaination about drivers.
dpdk_nic_bind: TODO
(vrouter-agent-dpdk)[root@a7s3 /]$ /opt/contrail/bin/dpdk_nic_bind.py --status
Network devices using DPDK-compatible driver ============================================ 0000:02:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection' drv=uio_pci_generic unused=ixgbe 0000:02:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection' drv=uio_pci_generic unused=ixgbe
Network devices using kernel driver =================================== 0000:04:00.0 'I350 Gigabit Network Connection' if=eno1 drv=igb unused=uio_pci_generic *Active* 0000:04:00.1 'I350 Gigabit Network Connection' if=eno2 drv=igb unused=uio_pci_generic
Other network devices ===================== <none>
Since LACP is running, for each member link LACP parameters are displayed.
Another way to show this information is with -l|--lacp option:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[root@a7s3 ~]# contrail-tools dpdkinfo -l all
LACP Rate: slow
Fast periodic (ms): 900
Slow periodic (ms): 29000
Short timeout (ms): 3000
Long timeout (ms): 90000
Aggregate wait timeout (ms): 2000
Tx period (ms): 500
Update timeout (ms): 100
Rx marker period (ms): 2000
Slave Interface(0): 0000:02:00.0
Details actor lacp pdu:
port state: 61 (ACT AGG SYNC COL DIST )
Details partner lacp pdu:
port state: 63 (ACT TIMEOUT AGG SYNC COL DIST )
Slave Interface(1): 0000:02:00.1
Details actor lacp pdu:
port state: 61 (ACT AGG SYNC COL DIST )
Details partner lacp pdu:
port state: 63 (ACT TIMEOUT AGG SYNC COL DIST )
LACP Packet Statistics:
Tx Rx
0000:02:00.0 13414 413
0000:02:00.1 13414 414
Here, you can get more insight of LACP running status, including all LACP
timers and PDU statistics about number of packet exchanged with the peer
device. Of course, here the counters are LACP PDU only. If we need all packets
received and sent through the bond interface, we can use -n|--stats option.
bond packet counters
-n|--stats option is useful to look into packet statistics of bond interface.
So far we’ve seen at least 2 ways of retreiving packet counters from a vif
interface:
-
vif --get X -
dpdkvifstats.py -v X
DPDK bond interface is represented by vRouter interface vif0/0, so you may
think setting X to 0 in the above commands achieves the same effect. The
problem is none of these tools print packet statistics for each member link of
the bond. Let’s take a look at an example here:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[root@a7s3 ~]# contrail-tools dpdkinfo --stats eth
Master Info:
RX Device Packets:28360664, Bytes:3233321316, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:28361174, Bytes:3234763122, Errors:0
Queue Rx: [0]28360664
Tx: [0]28361174
Rx Bytes: [0]3233321316
Tx Bytes: [0]3234760294
Errors:
---------------------------------------------------------------------
Slave Info(0000:02:00.0):
RX Device Packets:1421, Bytes:129257, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:28358167, Bytes:3234235595, Errors:0
Queue Rx: [0]1421
Tx: [0]28358167
Rx Bytes: [0]129257
Tx Bytes: [0]3234232767
Errors:
---------------------------------------------------------------------
Slave Info(0000:02:00.1):
RX Device Packets:28359275, Bytes:3233195707, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:3039, Bytes:531175, Errors:0
Queue Rx: [0]28359275
Tx: [0]3039
Rx Bytes: [0]3233195707
Tx Bytes: [0]531175
Errors:
---------------------------------------------------------------------
With the --stats eth option, dpdkinfo prints traffic distribution among all
member links of a DPDK bond interfaces. For example, in this example, we are
seeing the first member link(PCI bus 0000:02:00.0) received 1421 packets, while
the second member link (PCI bus 0000:02:00.1) received 28359275 packets. It is
obvious that the second member link carries most part of the traffic. Maybe you
are wondering why we end up with imbalanced traffic distributions, because
previously we’ve mentioned earlier that Transmit Hash Policy is set to load
balancing across member links. The reason is in this test environment we are
sending just one UDP flow!
With more flows we’ll see the balance happens. let’s send 10 flows, but before that let’s clear the current counters to make our second camparison easier:
1
2
3
[root@a7s3 ~]# contrail-tools vif --clear
Vif stats cleared successfully on all cores for all interfaces
Now we start rapid script to send 64 flows, and check same dpdkinfo command
output again:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[root@a7s3 ~]# contrail-tools dpdkinfo -n eth
Master Info:
RX Device Packets:471211, Bytes:53724144, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:471189, Bytes:53719798, Errors:0
Queue Rx: [0]471211
Tx: [0]471190
Rx Bytes: [0]53724144
Tx Bytes: [0]53719884
Errors:
---------------------------------------------------------------------
Slave Info(0000:02:00.0):
RX Device Packets:228370, Bytes:26033818, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:220073, Bytes:25090326, Errors:0
Queue Rx: [0]228370
Tx: [0]220076
Rx Bytes: [0]26033818
Tx Bytes: [0]25090640
Errors:
---------------------------------------------------------------------
Slave Info(0000:02:00.1):
RX Device Packets:242872, Bytes:27693860, Errors:0, Nombufs:0
Dropped RX Packets:0
TX Device Packets:251148, Bytes:28633120, Errors:0
Queue Rx: [0]242872
Tx: [0]251158
Rx Bytes: [0]27693860
Tx Bytes: [0]28634260
Errors:
---------------------------------------------------------------------
From the member link packet statistics, we are sure the traffic get balanced on both links.
Now you understand the -stats|-n option provides the insight of member link
usage reflected by a few RX/TX counters. Base on these information we can
determine the load balance status of a DPDK bond interface. So far all of the
packet counters we’ve seen, no matter under master or members, are almost the
same ones as what are provided by vif command. In practice, if you need to
get more extensive statistics, there is another option xstats|-x. Let’s go
check it out:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
[root@a7s3 ~]# contrail-tools dpdkinfo -xall | grep -v ": 0"
Master Info:
Rx Packets: Rx Bytes:
rx_good_packets: 852475379 rx_good_bytes: 97185979648
rx_q0packets: 852475379 rx_q0bytes: 97185979648
Tx Packets: Tx Bytes:
tx_good_packets: 852853117 tx_good_bytes: 97253818091
tx_q0packets: 852853127 tx_q0bytes: 97253769503
Errors:
Others:
------------------------------------------------------------------
Slave Info(0):0000:02:00.0 Slave Info(1):0000:02:00.1
Rx Packets: Rx Packets:
rx_good_packets: 412875343 rx_good_packets: 439600104
rx_q0packets: 412875343 rx_q0packets: 439600104
rx_size_64_packets: 5939 rx_size_64_packets: 19
rx_size_65_to_127_packets: 412869003 rx_size_65_to_127_packets: 439553375
rx_size_128_to_255_packets: 191 rx_size_128_to_255_packets: 42367
rx_size_256_to_511_packets: 206 rx_size_256_to_511_packets: 1173
rx_broadcast_packets: 5882 rx_size_512_to_1023_packets: 1242
rx_multicast_packets: 6124 rx_size_1024_to_max_packets: 1922
rx_total_packets: 412875340 rx_multicast_packets: 396
Tx Packets: rx_total_packets: 439600098
tx_good_packets: 399807799 Tx Packets:
tx_q0packets: 399807802 tx_good_packets: 453045397
tx_total_packets: 399807792 tx_q0packets: 453045399
tx_size_64_packets: 3552 tx_total_packets: 453045389
tx_size_65_to_127_packets: 399717757 tx_size_65_to_127_packets: 453035768
tx_size_128_to_255_packets: 59597 tx_size_128_to_255_packets: 6448
tx_size_256_to_511_packets: 10695 tx_size_256_to_511_packets: 9
tx_size_512_to_1023_packets: 831 tx_size_512_to_1023_packets: 1680
tx_size_1024_to_max_packets: 15360 tx_size_1024_to_max_packets: 1484
tx_multicast_packets: 6365 tx_multicast_packets: 6365
tx_broadcast_packets: 2941 Rx Bytes:
Rx Bytes: rx_good_bytes: 50119065424
rx_good_bytes: 47066921976 rx_q0bytes: 50119065424
rx_q0bytes: 47066921976 rx_total_bytes: 50119064740
rx_total_bytes: 47066921752 Tx Bytes:
Tx Bytes: tx_good_bytes: 51649995369
tx_good_bytes: 45603831138 tx_q0bytes: 51649996187
tx_q0bytes: 45603781752 Errors:
Errors: Others:
Others: rx_l3_l4_xsum_error: 439588641
rx_l3_l4_xsum_error: 412856784 out_pkts_untagged: 474447816
out_pkts_untagged: 549754060
------------------------------------------------------------------
As you can see, the output is very extensive - perhaps ten times more than
what vif, dpdkvifstats.py and dpdkinfo -n eth give. In fact to shorten
the output, we’ve removed all counters with a zero value in it, and also edited
the output format to compact all texts in two columns. if you go through it
quickly, you will be able to tell the fact that the majority part of the
traffic is composed of packets with size between 65 to 127 bytes, and that is
what we are sending from rapid script. Increasing traffic packet size from
rapid will end up with a different result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[root@a7s3 ~]# contrail-tools dpdkinfo -xall | grep -v ": 0"
Master Info:
....
--------------------------------------------------------------------
Slave Info(0):0000:02:00.0 Slave Info(1):0000:02:00.1
Rx Packets: Rx Packets:
rx_good_packets: 7902180 rx_good_packets: 7896450
rx_q0packets: 7902180 rx_q0packets: 7896450
rx_size_64_packets: 302 rx_size_64_packets: 1
rx_size_65_to_127_packets: 1731 rx_size_65_to_127_packets: 389
rx_size_128_to_255_packets: 7900126 rx_size_128_to_255_packets: 7895820
rx_size_256_to_511_packets: 15 rx_size_256_to_511_packets: 66
rx_size_512_to_1023_packets: 3 rx_size_512_to_1023_packets: 69
rx_size_1024_to_max_packets: 3 rx_size_1024_to_max_packets: 105
rx_broadcast_packets: 299 rx_multicast_packets: 20
rx_multicast_packets: 312 rx_total_packets: 7896450
rx_total_packets: 7902180 Tx Packets:
Tx Packets: tx_good_packets: 8272747
tx_good_packets: 7536810 tx_q0packets: 8272747
tx_q0packets: 7536810 tx_total_packets: 8272747
tx_total_packets: 7536810 tx_size_65_to_127_packets: 179
tx_size_64_packets: 181 tx_size_128_to_255_packets: 8272496
tx_size_65_to_127_packets: 290 tx_size_256_to_511_packets: 17
tx_size_128_to_255_packets: 7535143 tx_size_512_to_1023_packets: 53
tx_size_256_to_511_packets: 223 tx_size_1024_to_max_packets: 2
tx_size_512_to_1023_packets: 90 tx_multicast_packets: 324
tx_size_1024_to_max_packets: 883 Rx Bytes:
tx_multicast_packets: 323 rx_good_bytes: 1405706413
tx_broadcast_packets: 150 rx_q0bytes: 1405706413
Rx Bytes: rx_total_bytes: 1405706413
rx_good_bytes: 1406393359 Tx Bytes:
rx_q0bytes: 1406393359 tx_good_bytes: 1472542701
rx_total_bytes: 1406393359 tx_q0bytes: 1472542701
Tx Bytes: Errors:
tx_good_bytes: 1342701308 Others:
tx_q0bytes: 1342698774 rx_l3_l4_xsum_error: 7895846
Errors: out_pkts_untagged: 3532820029
Others:
rx_l3_l4_xsum_error: 7901213
out_pkts_untagged: 3249154601
--------------------------------------------------------------------
We won’t discuss all counters listed in this output, for now just add
dpdkinfo with these two options -n|stats and -x|xstats in your DPDK
vRouter troubleshooting toolkits. Consider to use them to collect information
whenever you run into traffic loss issues during your lab test or production
deployment.
Next we’ll explore another interesting option -c|--lcore.
lcore
There are several key concepts we’ve been trying to illustrate in this book.
Among others, at least four of them are often mentioned together: lcore,
interface and queue. Before we start introducing -c|--lcore option, let’s
briefly review these concepts.
lcore-
lcore is a thread in vRouter DPDK process running in user space
interface-
is the endpoints of connections between vRouter and other VM, or between vRouter and the outside of the compute. At the vRouter and VM end, the interfaces are called
vifandtapinterfaces respectively. There are also bond0 physical interface in DPDK user space andvhost0interface in linux kernel. The former is the physically NIC bundle connecting to the peer device, and the latter give the host an IP address and through with the vRouter agent can exchange control plane messages with the controller. queue-
for each interface there are some queues created. They are essentially some memories allocated to hold the packets.
The CPU cores connect all these objects together. As of the writing of this
book, the implemention is to have one to one mapping between the number of CPU
cores allocated to vRouter and the number of interface queues. For example, if
4 CPUs are allocated to DPDK vRouter forwarding threads (the lcores), then 4
lcores will be created, and 4 DPDK interface queues will be created for each
vif interface. Same rule applies to the VM - You assign 4 CPU cores to a VM,
then by default, Nova will create 4 (virtio??) queues for a tap interface in the
VM. That said, of course, multiple queue as a feature needs to be turned on
in Nova at the first place. We can illustrate this with a table below:
| vif | queue | lcore | queue | tap(vNIC) |
|---|---|---|---|---|
0/3 |
0 |
0 |
0 |
tap003 |
1 |
1 |
1 |
||
2 |
2 |
2 |
||
3 |
3 |
3 |
||
0/4 |
0 |
0 |
0 |
tap004 |
1 |
1 |
1 |
||
2 |
2 |
2 |
||
3 |
3 |
3 |
This is just a simple example. In production deployment there are a lot more conditions to consider, and a lot of of confusions rise. Common questions are:
-
What if the tap interface queue number is different than the vif queue number? What will happen when we have 8 lcores, but one of our VM are running 4 queues in its tap interface?
-
will vif0/3 queue0 always be served by lcore0, instead of other lcores? if not, how to determine which vif queue goes to which lcore? Is there a chance that imbalanced lcores to queue mapping happens, so that some lcores are overloaded and some lcores are relatively idle?
To answer these questions, we need a tool to reveal the "secret" of actual
mapping between lcores and queues from different vif interfaces. This is the
moment for -c|--lcore option of dpdkinfo to show its power. Again, let’s
start with an example:
1
2
3
4
5
6
7
8
9
10
[root@a7s3 ~]# contrail-tools dpdkinfo -c
No. of forwarding lcores: 2
No. of interfaces: 4
Lcore 0:
Interface: bond0.101 Queue ID: 0
Interface: vhost0 Queue ID: 0
Lcore 1:
Interface: bond0.101 Queue ID: 1
Interface: tap41a9ab05-64 Queue ID: 0
Let’s start from the first line. In this example, we have allocated two CPU cores to DPDK vRouter forwarding lcores, so we have 2 forwarding lcores running in total.
Then, the second line give number of vRouter interfaces in the compute. We have
4 of them in total. One vif0/4 connecting to VM tap interface tap41a9ab05-64,
three mandatory vif0/0, vif0/1, vif0/2, connecting to bond, vhost0 and pkt0
respectively. Here, we have created just one VM (actually this is nothing but
the PROX gen VM we’ve created earlier) with only one tap interface.
Starting from the third line onward are what we’ll focus now. The output is listing all forwarding lcores that are currently configured in vRouter, and for each lcore it list interfaces that this lcore is associcated with - in another word, interfaces this core is "serving".
Please note that there are some inconsistencies in term of the lcore numbering
in different tools.
* In dpdkvifstats.py script, forwarding lcore number starts from "1", so
"Core 1" refers to the first forwarding lcore.
* In dpdkinnfo -c output, forwarding lcore number starts from "0", so "Lcore
0" refers to the first forwarding lcore.
* In vif output, forwarding lcore number starts from "10", so "--core 10"
refers to the first forwarding lcore.
This may cause some confusions in our discussions. To make it consistent, in
the rest of this chapter we’ll use "the first forwarding lcore", fwd lcore#10,
or simply lcore#10; "the second forwarding lcore", fwd lcore#11, or simply
lcore#11, and so on, to indicate "Lcore 0", "Lcore 1" in dpdkinfo
-c output, "Core 1", "Core 2" in dpdkvifstats.py script output, and "Core
10", "Core 11" in vif output, respectively.
| vif | dpdkinfo -c |
dpdkvifstats.py | meaning |
|---|---|---|---|
|
Lcore 0 |
Core 1 |
1st forwarding lcore: lcore#10 |
|
Lcore 1 |
Core 2 |
2nd forwarding lcore: lcore#11 |
OK. As you may have realized, in the VM interface we use just one queue, which means the "multiple queue" feature on the VM interface is not enabled. Therefore the VM tap interface has only one queue connecting to its peering vRouter interface. Correspondingly, only one queue in vRouter interface is needed and only one lcore is required to serve the packet forwarding in the vif interface.
First, Let’s look at the bond0 and vhost0 interfaces. bond0 are the
physical interfaces, and it will always has multiple queues enabled, that is
why it has two queues, and both lcores serve it. The vhost0 interface is a
control plane linux interface. As the time of writing of this book, the
implementation is to hard-code vhost0 with one queue only. The first forwarding
thread lcore#10 got it. This is not the focus in this section but worth to
know to understand the whole output.
Finally, Let’s look at the last line - the VM tap interface. From the output,
we see it is the second forwarding lcore (lcore#11) being assigned to this VM
interface. You probably wonder is it just randomly chosen out of the 2 lcores
or some algorithms are used? It is not like that. Currently the allocation
basically follows a simple method. The least used lcore, in term of number of
interface queues it is serving, will be assigned to serve the next interface
queue. Based on what we just explained, lcore#10 took two interfaces
(bond0.101 and vhost0) while lcore#11 took just one (bond0.101), so it
is `lcore#11’s turn to take the next interface and queue.
|
Important
|
vNIC queues are assigned to logical cores in the following algorithm: The forwarding core that is currently polling the least number of queues is selected, with a tie won by the core with the lowest number (the first forwarding core lcore#10). |
We’ll see more examples in later sections, in there we’ll test out the "tie breaker" and other things. We can convert the above mapping into a table like this:
| vif | queue | lcore | queue | tap(vNIC) |
|---|---|---|---|---|
0/0 |
0 |
0 |
0 |
bond0 |
1 |
1 |
1 |
||
0/1 |
0 |
0 |
0 |
vhost0 |
0/3 |
0 |
1 |
0 |
tap41a9ab05-64 |
Now we’ve went through dpdkinfo command and demonstrated some most commonly
used options. With this command you can quickly print out a lot of useful
information about DPDK and DPDK vRouter running status. We’ll review this again
later in our test case studies. These information is important to know before
we work on any deployment or troubleshooting task in the setup. However, when
things go wrong, instead of just relying on the dpdk commands output, you may
also want to check into the log messages to verify the current running status
is as what you’ve expected it to be. Next we’ll take a look at DPDK vRouter log
messages.
dpdk vRouter log files (TODO)
TODO: this is copied and rewritten based on Laurent’s chapter 5. will need to capture in same environment and rewrite again.
Contrails DPDK vrouter dataplane log file is named contrail-vrouter-dpdk.log.
Depending on the version or installation methods, it can be located in
different folders or even with a totally different name. For example:
-
in latest TripleO deployment:
/var/log/containers/contrail/dpdk/contrail-vrouter-dpdk.log -
in latest ansible deployment:
/var/log/contrail/contrail-vrouter-dpdk.log -
in older 3.x ubuntu deployemnt:
/var/log/contrail.log
This log file contains lots of good information that is helpful to understand the current running status. Understanding the log messages are important during troubleshooting process.
DPDK vrouter parameters
Each time the vrouter is started, the main configuration parameters are listed in the log file during the vrouter initialization stage. We can see the DPDK library version that has be use to build the DPDK vrouter binary program.
Here is an example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2020-09-15 20:27:22,381 VROUTER: vRouter version: {"build-info":
[{"build-time": "2020-09-15 01:07:25.101398", "build-hostname":
"contrail-build-r2008-rhel-115-generic-20200914170527.novalocal", "build-user":
"contrail-builder", "build-version": "2008"}]}
2020-09-15 20:27:22,382 VROUTER: DPDK version: DPDK 19.11.0
2020-09-15 20:27:23,046 VROUTER: Log file : /var/log/contrail/contrail-vrouter-dpdk.log
2020-09-15 20:27:23,046 VROUTER: Bridge Table limit: 262144
2020-09-15 20:27:23,046 VROUTER: Bridge Table overflow limit: 53248
2020-09-15 20:27:23,046 VROUTER: Flow Table limit: 524288
2020-09-15 20:27:23,046 VROUTER: Flow Table overflow limit: 105472
2020-09-15 20:27:23,046 VROUTER: MPLS labels limit: 5120
2020-09-15 20:27:23,046 VROUTER: Nexthops limit: 32768
2020-09-15 20:27:23,046 VROUTER: VRF tables limit: 4096
2020-09-15 20:27:23,046 VROUTER: Packet pool size: 16384
2020-09-15 20:27:23,046 VROUTER: PMD Tx Descriptor size: 128
2020-09-15 20:27:23,046 VROUTER: PMD Rx Descriptor size: 128
2020-09-15 20:27:23,046 VROUTER: Maximum packet size: 9216
2020-09-15 20:27:23,046 VROUTER: Maximum log buffer size: 200
2020-09-15 20:27:23,046 VROUTER: VR_DPDK_RX_RING_SZ: 2048
2020-09-15 20:27:23,046 VROUTER: VR_DPDK_TX_RING_SZ: 2048
2020-09-15 20:27:23,046 VROUTER: VR_DPDK_YIELD_OPTION: 0
2020-09-15 20:27:23,046 VROUTER: VR_SERVICE_CORE_MASK: 0x10
2020-09-15 20:27:23,046 VROUTER: VR_DPDK_CTRL_THREAD_MASK: 0x10
2020-09-15 20:27:23,046 VROUTER: Unconditional Close Flow on TCP RST: 0
2020-09-15 20:27:23,046 VROUTER: EAL arguments:
2020-09-15 20:27:23,046 VROUTER: -n "4"
2020-09-15 20:27:23,046 VROUTER: --socket-mem "1024"
Here we see a complete list of vRouter start up parameters of this Contrail vRouter, for example:
-
build-version"2008" -
is running DPDK Version
19.11.0. -
Nexthops limitparameter is configured as 32768 - decreased from the default value (65536). -
CPU core #4 is pinned to be used by control and service thread (
VR_SERVICE_CORE_MASK: 0x10)
We can compare these information with what we can print with these command line tools and see if they are consistent:
-
contrail-version -
dpdkinfo -v -
vrouter --info -
taskset
Any inconsistency will provide a clue to proceed in that area.
Polling core allocation
In chapter 3 we’ve introduced that DPDK vRouter process is a multiple threads application and the threads falls into different categories based on their roles. This is also reflected by some log entries. Before we dive into the logs, Let’s do a quick review of the three thread categories:
- Control threads
-
They are generated by DPDK libraries and are used during Contrail vRouter startup for DPDK initialization. control threads are not our focus in this book.
- Cervice threads
-
There are totally hard-coded two service threads named
lcore0throughlcore9. Each lcore has its own role. For examplelcore9servesnetlinkconnection between agent and vRouter data plane. Details of each lcore’s rule is out of this book’s scope. We just need to know they are used to serve communication between vrouter agent and vrouter forwarding plane. - Forwarding threads
-
After service threads, from
lcore10and onward, the forwarding threads are the main horse power that performs the packet forwarding tasks and determines the performance of DPDK vRouter. This is the main focus of our book.
|
Note
|
In service threads, lcore3 to lcore7 are never used in contrail DPDK
vRouter.
|
OK. Now let’s take a look at a interesting log entry:
2020-09-16 09:06:50,886 VROUTER: --lcores "(0-2)@(10,34),(8-9)@(10,34),*10@2,11@4,12@6,13@8*
Here, we understand the string --lcores means a service thread, or a
forwarding thread. Following this string is a few coupled numbers connected by
@ - "NUMBER@NUMBER" - which are seperated by commas. How to decode these?
Well, to understand this we need to understand CPU pinning. To achieve maximum
performance we’re pinning the service and forwarding threads(or lcores) each
with a few specific CPU cores, so each thread will be served by dedicated CPUs
that are isolated from any other system tasks. So this log reads:
-
Service threads, that is lcore0 to lcore2 and lcore8-lcore9 in the message, are all pinned to two CPU cores: core#10 and CPU core#34. The pinning is configured by the
SERVICE_CORE_MASKparameter. -
Forwarding threads, lcore10 to lcore13, are allocated are pinned to CPU core#2, core#4, core#6 and core#8, respectively. This is configured from the
CPU_LISTparameter.
Internal Load Balancing
In some situation the polling core performs a new hash calculation to distribute the polled packets to another processing core. This is a DPDK "pipeline model" implemented in the vrouter.
This distribution behavior can be observed in the following messages in DPDK log file:
2020-01-07 13:08:01,403 VROUTER: Lcore 10: distributing MPLSoGRE packets to [11,12,13] 2020-01-07 13:08:01,403 VROUTER: Lcore 11: distributing MPLSoGRE packets to [10,12,13] 2020-01-07 13:08:01,403 VROUTER: Lcore 12: distributing MPLSoGRE packets to [10,11,13] 2020-01-07 13:08:01,404 VROUTER: Lcore 13: distributing MPLSoGRE packets to [10,11,12]
Here the logs show MPLSoGRE, but it actually applies to both MPLSoGRE or
VxLAN packets. this is due to historically only MPLS GRE was supported. So,
it remains like that in the software code. Here is means both MPLSoGRE and
VxLAN packet will be distributed via hashing by the polling core.
Virtual Interface queues
Each time a new virtual interface is connected to the vrouter, a vif port is
created on the vrouter with the same number of queues as the number of polling
CPU (specified in CPU_LIST parameter). Each queue created is handled by only
one of the vrouter polling core. So, for each vif, we have a one to one mapping
between vrouter polling cores and RX queues. This mapping can be seen from
dpdkinfo -c command output which we’ve introduced. The same can be observed
in DPDK vrouter logs:
2019-09-24 16:36:50,011 VROUTER: Adding vif 8 (gen. 37) virtual device tap66e68bc1-a9 .... 2019-09-24 16:36:50,012 VROUTER: lcore 12 RX from HW queue 0 2019-09-24 16:36:50,012 VROUTER: lcore 13 RX from HW queue 1 2019-09-24 16:36:50,012 VROUTER: lcore 10 RX from HW queue 2 2019-09-24 16:36:50,012 VROUTER: lcore 11 RX from HW queue 3
Here the vif interface 0/8 is created in order to connect the virtual NIC tap66e68bc1-a9 to the vrouter. Because 4 forwarding lcores are configured, this vif is created with 4 queues, namely q0 to q3, which are respectively handled by polling cores 12,13,10 and 11.
When a polling queue is enabled on the vrouter, a ring activation message is generated in the Contrail DPDK log file.
The vrings correspond to both transmit and receive queues:
-
the transmit queues are the even numbers. Divide them by 2 to get the queue number. i.e. vring 0 is TX queue 0, vring 2 is TX queue 1, …
-
the receive queues are the odd numbers. Divide them by 2 (discard the remainder) to get the queue number. i.e. vring 1 is RX queue 0, vring 3 is RX queue 1,
-
ready state 1 = enabled. ready state 0 = disabled
In the example below, only 1 RX (and TX) queue is enabled on the vrouter vif interface. A single queue virtual machine interface is connected to the vrouter port:
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 0 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 1 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 2 ready state 0 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 3 ready state 0 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 4 ready state 0 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 5 ready state 0 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 6 ready state 0 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 7 ready state 0
In the example hereafter, 4 RX (and TX) queues are enabled on the vrouter vif interface. But a virtual machine interface having more than 4 queues is connected to the vrouter port:
2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 0 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 1 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 2 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 3 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 4 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 5 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 6 ready state 1 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: setting vring 7 ready state 1 2019-09-24 16:37:46,693 UVHOST: vr_uvhm_set_vring_enable: Can not disable TX queue 4 (only 4 queues) 2019-09-24 16:37:46,693 UVHOST: Client _tap66e68bc1-a9: handling message 18 2019-09-24 16:37:46,693 UVHOST: vr_uvhm_set_vring_enable: Can not disable RX queue 4 (only 4 queues)
As there are more than 4 queues on the virtual machine interface, some queues must not be enabled on the virtual machine NIC. Unfortunately, these queues can ’t be disabled on the virtual machine. Therefore, this setup is faulty.
dpdk vRouter case studies
In previous sections, we’ve introduced some dpdk tools and explained some important log entries. to help collecting DPDK vRouter running status.
single queue
Having understood the lcore mapping basics, let’s start a test with some traffic flowing.
one way single flow: VM to fabric
To make it very simple, we are sending single uni-directional UDP flow from the PROX gen VM. We can list current flows we have in vRouter to confirm this.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[root@a7s3 ~]# contrail-tools flow -l
Flow table(size 161218560, entries 629760)
......
Index Source:Port/Destination:Port Proto(V)
-----------------------------------------------------------------------------------
40196<=>436016 192.168.0.106:59514 6 (3)
192.168.0.104:22
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):36, Stats:503/35823,
SPort 56703, TTL 0, Sinfo 8.0.0.3)
436016<=>40196 192.168.0.104:22 6 (3)
192.168.0.106:59514
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):27, Stats:511/71619,
SPort 49812, TTL 0, Sinfo 4.0.0.0)
62792<=>172020 192.168.0.106:48664 6 (3)
192.168.0.104:8474
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):36, Stats:3828/296117,
SPort 63470, TTL 0, Sinfo 8.0.0.3)
172020<=>62792 192.168.0.104:8474 6 (3)
192.168.0.106:48664
(Gen: 1, K(nh):27, Action:F, Flags:, TCP:SSrEEr, QOS:-1, S(nh):27, Stats:2739/274615,
SPort 52648, TTL 0, Sinfo 4.0.0.0)
38232<=>257372 192.168.1.105:32768 17 (2)
192.168.1.104:32770
(Gen: 5, K(nh):30, Action:F, Flags:, QOS:-1, S(nh):37, Stats:0/0, SPort 61739,
TTL 0, Sinfo 0.0.0.0)
257372<=>38232 192.168.1.104:32770 17 (2)
192.168.1.105:32768
(Gen: 5, K(nh):30, Action:F, Flags:, QOS:-1, S(nh):30, Stats:390003/48360372,
SPort 62464, TTL 0, Sinfo 3.0.0.0)
Here, we see 6 vRouter flows, which are in fact 3 groups. The first 2 groups
with index pairs 40196/436016 and 62792/172020 are generated by the
control messages from rapid "jump" VM into the PROX gen VM. The last group of
flows with index pairs 38232/257372 is our single flow test traffic. The
stats 39003/48360372 shows the traffic flow is sent from gen VM
(192.168.1.104:32770) to swap VM (192.168.1.105:32768).
|
Note
|
In contrail vRouter, flows are generated in pairs. For any traffic, even if it is one direction only, vRouter will generate a "reverse flow" for it. This is because in real world most of the traffic are bidirectional, so having an seperate entry built for each direction is required. In our case, from PROX we are generating uni-directional traffic, so only flow of that direction has packet statistics. it’s pairing flow entry is generated as well, but packet statistics shows nothing. |
Let’s clear vif counters, and collect the statistics using dpdkvifstats.py
tool:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@a7s3 ~]# contrail-tools vif --clear
Vif stats cleared successfully on all cores for all interfaces
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -v 3 -c 2
------------------------------------------------------------------
| Core 1 | TX pps: 0 | RX pps: 1504 | TX bps: 0 | RX bps: 90240
| Core 2 | TX pps: 1 | RX pps: 1 | TX bps: 42 | RX bps: 56
| Total | TX pps: 1 | RX pps: 1505 | TX bps: 336 | RX bps: 722368
------------------------------------------------------------------
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -v 0 -c 2
--------------------------------------------------------------------
| Core 1 | TX pps: 1512 | RX pps: 2 | TX bps: 166320 | RX bps: 132
| Core 2 | TX pps: 1 | RX pps: 1 | TX bps: 112 | RX bps: 110
| Total | TX pps: 1513 | RX pps: 3 | TX bps: 1331456 | RX bps: 1936
--------------------------------------------------------------------
From The first capture on the vRouter interface connecting to the PROX gen VM
tap interface (-v 3), we are seeing "lcore#10" received the traffic - we
can tell from the RX speed 1504 pps showing in "Core 1" only. The second
capture on the vRouter interface toward bond interface (-v 0) confirmed the
same - it is the same lcore#10 ("Core 1" here) that is sending the traffic
to the bond interface, at speed of 1512 pps, almost the same as the speed it
received the traffic from VM tap interface. This flow is illustrated here:
VM: tap41a9ab05-64 => vif0/3 => lcore#10 => vif0/0 => bond0
This seems to be "weird", does it? Remember previously based on the
core-interface mapping given by dpdkinfo -c We already knew it was the
lcore#11 serving our VM interface, not the other one.
Accordingly, in dpdkvifstats.py output, that should be "Core 2" instead of
"Core 1". Let’s revisit the mapping:
1
2
3
4
5
6
7
8
9
10
[root@a7s3 ~]# contrail-tools dpdkinfo -c
No. of forwarding lcores: 2
No. of interfaces: 4
Lcore 0:
Interface: bond0.101 Queue ID: 0
Interface: vhost0 Queue ID: 0
Lcore 1:
Interface: bond0.101 Queue ID: 1
Interface: tap41a9ab05-64 Queue ID: 0
So we are right. The flow that is "expected" should be something like this:
VM: tap41a9ab05-64 => vif0/3 => lcore#11 => vif0/0 => bond0
Well, if you remember what you’ve read in chapter 3, you probably will know the
answer. When a packet flows from our PROX gen VM to the bond, vRouter uses a
pipeline model to process the packet. What that really means is, the
interface’s serving lcore, that is the second forwarding lcore in our case
based on dpdkinfo -c output, will poll it out of the vif interface. In
chapter 3, when we introduce the vRouter packet forwarding process, we’ve
mentioned the when traffic flows from vif connecting VM tap interface to
vif0/0, all packets will be distributed by the "polling lcore" to other lcores
for processing. The distribution is calculated based on the hash of the packet
header.
Apparently, Here the "polling" core, based on the mapping above, is lcore#11,
and the only "other" lcore is the first forwarding lcore lcore#10.
So packets from VM got polled by the lcore#11 and then distributed to the
lcore#10, which then forwarded to the fabric interface vif0/0.
Currently dpdkvifstats.py does not tell much about these details, but if you
collect vif output, you will see some more clues:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[root@a7s3 ~]# contrail-tools vif --get 3 --core 10
Vrouter Interface Table
......
vif0/3 PMD: tap41a9ab05-64 NH: 38
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:2 Mcast Vrf:2 Flags:L3L2DEr QOS:-1 Ref:12
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Core 10 RX packets:31272 bytes:1876320 errors:0
Core 10 TX packets:0 bytes:0 errors:0
Drops:18660668
[root@a7s3 ~]# contrail-tools vif --get 3 --core 11
Vrouter Interface Table
......
vif0/3 PMD: tap41a9ab05-64 NH: 38
Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:192.168.1.104
Vrf:2 Mcast Vrf:2 Flags:L3L2DEr QOS:-1 Ref:12
Core 11 RX queue packets:35384 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Core 11 RX packets:26 bytes:1092 errors:0
Core 11 TX packets:24 bytes:1008 errors:0
Drops:18660668
There is a "RX queue" counter Core 11 RX queue packets:35384 gives a little
bit clue about this inter-core distribution. Core 11, our second forwarding
lcore, polled the packet first from vif0/3 into its RX queue. Instead of
"processing" the packet, it distributed oto the first forwarding lcore, Core
10, which, then "processed" them. That is why same amount of packets are
counted as RX packets in Core 10. Therefore the full story is a flow like
this:
(polling lcore) (processing lcore) VM: tap41a9ab05-64 => vif0/3 => lcore#11 => lcore#10 => vif0/0 => bond0
For the sake of completeness, we also captured the vif command on fabric interface vif0/0:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[root@a7s3 ~]# contrail-tools vif --get 0 --core 10
Vrouter Interface Table
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
Core 10 RX device packets:199 bytes:49057 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
Core 10 RX packets:131 bytes:37595 errors:0
Core 10 TX packets:48756 bytes:5362888 errors:0
Drops:0
Core 10 TX device packets:49024 bytes:5730372 errors:0
[root@a7s3 ~]# contrail-tools vif --get 0 --core 11
Vrouter Interface Table
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
Core 11 RX packets:67 bytes:9860 errors:0
Core 11 TX packets:181 bytes:162062 errors:0
Drops:0
Here after the first forwarding lcore processed the packets from vif0/0, it
sent them out of vif0/0, which is reflected as TX packets and TX device
packets.
|
Important
|
What we’ve tested and demonstrated is the DPDK vRouter default
behavior with the current parameters it current takes. Please keep in mind that
vRouter is configurable. There is one vRouter configuration option introduced
in release R2008 which will change this default pipeline model behavior. This
option is --vr_no_load_balance, and we can verify the vrouter-dpdk process
running command line in our setup with ps command. With that configured,
vRouter will change to the so-called run to complete model, which means same
lcore whichever polled the packet will continue to process/forward it. This
requires reboot of DPDK vRouter, and We won’t test this scenarios in this book.
|
This concludes the analysis of traffic forwarding in the direction of VM to fabric. Next let’s take a look at the returning direction: from fabric (vif0/0) to VM (vif0/3).
returning traffic: fabric to VM
Now we add returning traffic. We configure the swap VM in such a way that it loops whatever it received back to the sender. Here is the capture:
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -v 3 -c 2
---------------------------------------------------------------------------------
| Core 1 | TX pps: 0 | RX pps: 85274 | TX bps: 0 | RX bps: 10574058 ..
| Core 2 | TX pps: 85278 | RX pps: 1 | TX bps: 10574431 | RX bps: 56 ..
| Total | TX pps: 85278 | RX pps: 85275 | TX bps: 84595448 | RX bps: 84592912 ..
---------------------------------------------------------------------------------
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -v 0 -c 2
---------------------------------------------------------------------------------
| Core 1 | TX pps: 85844 | RX pps: 16 | TX bps: 14936710 | RX bps: 1940 ..
| Core 2 | TX pps: 1 | RX pps: 85846 | TX bps: 88 | RX bps: 14937132 ..
| Total | TX pps: 85845 | RX pps: 85862 | TX bps: 119494384 | RX bps: 119512576..
---------------------------------------------------------------------------------
Here, we are looking at the returning traffic from fabric back to our PROX gen VM:
RX TX fabric: bond0 => vif0/0 => lcore#? => vif0/3 => tap41a9ab05-64 => VM
So we focus on seeing the RX in vif 0/0 and TX in vif0/3, and the data
shows lcore#11 received the packets from vif0/0 and forwarded out of vif0/3.
To confirm if this lcore is also the polling lcore, we’ll need to look at the
vif capture:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[root@a7s3 ~]# contrail-tools vif --get 0 --core 10
Vrouter Interface Table
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
Core 10 RX device packets:3481584 bytes:619708685 errors:0
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
Core 10 RX packets:676 bytes:106243 errors:0
Core 10 TX packets:3482241 bytes:605899226 errors:0
Drops:99
Core 10 TX device packets:3482474 bytes:619966089 errors:0
[root@a7s3 ~]# contrail-tools vif --get 0 --core 11
Vrouter Interface Table
......
vif0/0 PCI: 0000:00:00.0 (Speed 20000, Duplex 1) NH: 4
Type:Physical HWaddr:90:e2:ba:c3:af:20 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpVofEr QOS:-1 Ref:18
RX queue errors to lcore 0 0 0 0 0 0 0 0 0 0 0 0
Fabric Interface: eth_bond_bond0 Status: UP Driver: net_bonding
Slave Interface(0): 0000:02:00.0 Status: UP Driver: net_ixgbe
Slave Interface(1): 0000:02:00.1 Status: UP Driver: net_ixgbe
Vlan Id: 101 VLAN fwd Interface: vfw
Core 11 RX packets:3594939 bytes:625517508 errors:0
Core 11 TX packets:166 bytes:133391 errors:0
Drops:99
We do not see any RX queue packets as what we’ve seen in the data we
collected on the VM to fabric direction. Therefore in this direction we don’t
see any inter-core load balancing behavior as we’ve elaborated before.
This concludes our analysis to the bidirectional single flow traffic. As you can see, one benefit to have traffic generator/swapper built in lab environment is, we can fine tune the generator to send traffic in a very specific pattern, so that we can take a deep look at the counters and analyze the vRouter traffic forwarding behavior. This is very helpful for learning purpose. In production, you probably never expect to has such a "luxury" since the traffic pattern in the "field" is usually much more complex. But don’t worry, we can add more and more complexities to our traffic pattern so eventually you will see something close to what you would see in real life.
Next, we’ll add more flows in our testbed and check the result.
multiple flows
Here, we are sending 64 flows from PROX gen VM. To confirm the flow numbers we
use flow -s command in contrail-tools:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@a7s3 ~]# contrail-tools flow -s
Flow Statistics
---------------
Total Entries --- Total = 132, new = 0
Active Entries --- Total = 132, new = 0
Hold Entries --- Total = 0, new = 0
Fwd flow Entries - Total = 132
drop flow Entries - Total = 0
NAT flow Entries - Total = 0
Rate of change of Active Entries
--------------------------------
current rate = 0
Avg setup rate = 0
Avg teardown rate = 0
Rate of change of Flow Entries
------------------------------
current rate = 0
132 flows entries means 66 groups of flows in our test. The additional 2 groups of flows are the control flows between jump VM and gen VM. Good, let’s collect the traffic statistics.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@a7s3 ~]# contrail-tools vif --clear
Vif stats cleared successfully on all cores for all interfaces
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -all -c 2
| VIF 3 | Core 1 | TX pps: 1 | RX pps: 85248 | TX bps: 448 | RX bps: 84566016
| VIF 3 | Core 2 | TX pps: 1 | RX pps: 1 | TX bps: 336 | RX bps: 560
| VIF 0 | Core 1 | TX pps: 85842 | RX pps: 15 | TX bps: 119490528 | RX bps: 14744
| VIF 0 | Core 2 | TX pps: 0 | RX pps: 0 | TX bps: 0 | RX bps: 0
------------------------------------------------------------------------
| pps per Core |
------------------------------------------------------------------------
|Core 1 |TX + RX pps: 171133 | TX pps 85858 | RX pps 85275 |
|Core 2 |TX + RX pps: 2 | TX pps 1 | RX pps 1 |
------------------------------------------------------------------------
|Total |TX + RX pps: 171135 | TX pps 85859 | RX pps 85276 |
------------------------------------------------------------------------
Still, the lcore#10 processed the packets and forwarded to out of vif0/0.
If you compare this result with our first test, where we have just one
uni-directional flow, there is simply no difference. Shouldn’t we expect to see
some load balance between lcores since we have more flows now? We should, but
that is only when the VM tap interface has "multiple queues". With just one
queue, the mapping between our tap interface and lcores never changes. In our
case it’s always lcore#11 polling the traffic and distributing to
lcore#10, hence we’ll always see packet being forwarded by lcore#10 instead
of lcore#11, regardless of number of flows and traffic volumes.
On the other direction, if we enable the returning traffic, we’ll see on VIF
0 (vif0/0) the two lcores' traffic are RX pps: 41547 and RX pps: 44257,
which is well balanced - because we have two queues enabled on the vif0/0.
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -all -c 2
| VIF 3 | Core 1 | TX pps: 41249 | RX pps: 85182 | TX bps: 40919336 | RX bps: 84500544
| VIF 3 | Core 2 | TX pps: 43936 | RX pps: 1 | TX bps: 43584072 | RX bps: 336
| VIF 0 | Core 1 | TX pps: 85765 | RX pps: 41547 | TX bps: 119382912 | RX bps: 57825008
| VIF 0 | Core 2 | TX pps: 3 | RX pps: 44257 | TX bps: 18216 | RX bps: 61604304
------------------------------------------------------------------------
| pps per Core |
------------------------------------------------------------------------
|Core 1 |TX + RX pps: 253763 | TX pps 127025 | RX pps 126738 |
|Core 2 |TX + RX pps: 88197 | TX pps 43939 | RX pps 44258 |
------------------------------------------------------------------------
|Total |TX + RX pps: 341960 | TX pps 170964 | RX pps 170996 |
------------------------------------------------------------------------
|
Important
|
With single queue in VM tap interface, it’s hard to achieve good load balance between lcores on the vRouter interface facing the Virtual Machine. Sometime we need to enable "multiple queue" to make better use of all our DPDK forwarding lcores. |
This concludes our analysis on one single queue test, and we’ll go ahead to test "multiple queues".
multiple queues
Let’s look at a multiple queue example.
Based on the previous setup, this time we added one more queue in tap interface of VM gen and then collect the core interface mapping:
1
2
3
4
5
6
7
8
9
10
11
[root@a7s3 ~]# contrail-tools dpdkinfo -c
No. of forwarding lcores: 2
No. of interfaces: 5
Lcore 0:
Interface: bond0.101 Queue ID: 0
Interface: vhost0 Queue ID: 0
Interface: tap41a9ab05-64 Queue ID: 1
Lcore 1:
Interface: bond0.101 Queue ID: 1
Interface: tap41a9ab05-64 Queue ID: 0
Here is the table view of these mapping:
| vif | queue | lcore | queue | tap(vNIC) |
|---|---|---|---|---|
0/0 |
0 |
0 |
0 |
bond0 |
1 |
1 |
1 |
||
0/1 |
0 |
0 |
0 |
vhost0 |
0/3 |
0 |
1 |
0 |
tap41a9ab05-64 |
1 |
0 |
1 |
So most items remains the same, except we have one more queque added on tap and
the vRouter interface to which it attaches, correspondingly, one core is
allocated to serve this new queue. Before this new queue was created we already
know that each of our lcores is serving same amount of queues, therefore as a
"tie breaker", which we’ve mentioned when we introduce dpdkinfo -c
previously, the first forwarding lcore, lcore#10 with our notation, is
allocated for the new queue.
Let’s check the traffic distribution between lcores with multiple queues on VM tap interface:
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@a7s3 ~]# contrail-tools dpdkvifstats.py -all -c 2
| VIF 3 | Core 1 | TX pps: 41319 | RX pps: 42606 | TX bps: 40988672 | RX bps: 42264712
| VIF 3 | Core 2 | TX pps: 43889 | RX pps: 42604 | TX bps: 43537008 | RX bps: 42262288
| VIF 0 | Core 1 | TX pps: 42923 | RX pps: 41540 | TX bps: 59748824 | RX bps: 57815160
| VIF 0 | Core 2 | TX pps: 42918 | RX pps: 44320 | TX bps: 59741640 | RX bps: 61693328
------------------------------------------------------------------------
| pps per Core |
------------------------------------------------------------------------
|Core 1 |TX + RX pps: 168416 | TX pps 84258 | RX pps 84158 |
|Core 2 |TX + RX pps: 173731 | TX pps 86807 | RX pps 86924 |
------------------------------------------------------------------------
|Total |TX + RX pps: 342147 | TX pps 171065 | RX pps 171082 |
------------------------------------------------------------------------
Now, since we have multiple queues on both VM tap interface and the fabric interface. Traffic on all lcores are very well balanced. Please keep this in mind as an "ideal" traffic profile that we are expecting the vRouter to has. In production, we usually deal with more complicated vRouter lcore configurations and traffic profiles, so the lcore balancing may be appear as perfect as what we are seeing in lab environment, but at least you have a good baseline in your mind and knows what to look when the result is far worse than expected.
TODO:
-
a "bad case", tap queue num > lcore num
-
make 4 lcores and redo everything, depending on time.